Rewrite and collaboration

raffaelj commented 4 years ago

@iaincollins A week ago I searched (again) for resources to validate html markup and meta data. As always, the most found answer was to use Google's structured data testing tool, which might be a workaround, but it is a bad answer. So I searched through npm packages to find existing tools or good approaches and I found your tool.

Well, I already mentioned in #13, that I'm not the biggest fan of your coding style. I apologise, if I insulted you with "Your code is a mess.". After looking at a dictionary, I discovered some more meanings than "not well structured and really hard to read"...

What I've done

I rewrote your tool and the web-auto-extractor into an object oriented style. It is readable, modular and easy to extend. It has two entry points for browser and for cli usage and I covered most of your existing features. It needs some cleanup here and there, the schema handling isn't fully implemented and it only works with urls for the moment. It shouldn't be much work to add the missing features.

I published it today as an alpha release: https://github.com/raffaelj/seo-meta-validator See README.md and code comments for more details or ask me anything, you want to know about it.

What I want

My main goal is to validate websites on localhost before I publish them and I need a browser based ui. I also think about splitting the project into multiple packages/repos - e. g. core, cli tool and browser based test suite.

What do you think about a complete rewrite?

It doesn't have to match my style, but I really need some structure to be able to contribute.

Are you interested in a collaboration?

I'm very interested in your knowledge about data validation and I'm not familiar with publishing npm packages. I also like your style of writing detailed issues and documentation.

iaincollins commented 4 years ago

Hi there!

I have to pop out now, but as I was commenting in another issue just thought I'd quickly say I've seen this and will follow up. Thanks for the detail!

tl;dr very much up for collaboration. I think trying to build a tool that runs a web browser might add some additional challenge (including around testing), but is an interesting idea!

I've been doing some work to add schema property validation and am tempted to hold off a big re-write until that is done as I think that might inform a few things as problems with the design might shake out as that work is done. I've had that hunch for a while, which is why I haven't refactored much recently.

I'd also like to improve the test coverage before I start changing a lot as it's super easy for new bugs to creep in. That said, I'm happy to move blocks of code around in the shorter term, including functions into dedicated files (classes, etc) and make things more re-usable!

The repo looks interesting, will check it out properly when I'm back!

raffaelj commented 4 years ago

Sorry for the late reply. I needed a break after working a bit too much.

I've been doing some work to add schema property validation and am tempted to hold off a big re-write until that is done...

Totally understandable.

I will do some cleanup and improve my tool in the next days and I have to think some more about the usability of my approach. I don't want to act precipitately. Let's keep in touch.

raffaelj commented 4 years ago

I did some research, a lot of cleanup and restructuring and I worked on the schema validation. Maybe you are interested in my progress and/or you need some inspiration.

I found a way to check for valid properties and the data type validation for schema properties is partially implemented. I needed multiple recursive functions, so I hope, that it is readable. It is still a bit messy and it needs some more restructuring, but it feels like I'm on the right way.

I also figured out, that some validation must fail due to wrong results from the meta data parsers. The web-auto-extractor (and my slightly improved parsers) are too simple.

An example:

"author": "A. Smith" should be parsed as "author": {"@type":"Thing", "name":"A. Smith"}

If you copy your test files into Google's structured data testing tool, you'll see, that they parse it that way. With correctly parsed data, that finds defaults and maps some strings to typed objects, it is easy to check the values against expected data types.

iaincollins commented 4 years ago

I thoughts I'd share my current status!

It looks like work on 5.0 is coming along fine.

The work in progress has a list of all valid schema properties and draft schema properties and it is able to identify and distinguish between both types (and generates warnings, but not errors, on draft properties), and also generates errors when invalid schema properties are found.

It is not yet able to do this on nested schema properties yet, and that's a crucial feature that I think needs to be in for 5.0 and am continuing work on.

I think I do need to do a bit of code clean up before I can work going beyond that to actually validate values of properties (e.g. Text, Date, etc). My thoughts are actual validation of values is likely to be in 5.1 (or some other subsequent 5.x release).

I think 6.x will probably look like rewriting all the unit tests - there are lot of cases not catered for, and the current tests cover a bunch of functionality in a non explicit way and can be hard to debug - and then, when the test coverage is in a better state, starting to refactor the code base in a more significant way.

I think this approach will likely uncover quite a few bugs that will be missed in the interim, but is also the fastest way to drive improvement. Right now lots of common scenarios work really well and most of the bugs are with edge cases - such as with sites like https://www.nytimes.com that use some more advanced approaches - so I think there is value in doing it this way.

Having other approaches as a reference I think is very useful, and happy to make things more modular (e.g. so it's easier to target running in a browser).

I'm still on a journey with this codebase to see if the approach can be refined without scrapping everything, but I think after 5.1 will be in a good place to consider if it's turned out okay or not (ie if should be scrapped). A full re-write might necessary if turns out it's all too convoluted, but I suspect actually maintaining feature parity on a re-write will be quite tricky.

My biggest concern right now is probably the test coverage (amount and quality of it) as the code is relatively easy to refactor (although slow going for anyone else I'm sure, given it is hairy) but the tests not being easy to follow and comprehensive is a problem as refactoring is somewhat risky right now, given the state of them.

Will keep you updated! Interested to hear how you are getting on!

iaincollins / structured-data-testing-tool

Rewrite and collaboration #16