postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.42k stars 445 forks source link

feat: update all fixtures and custom parsers to match #713

Closed sdoire closed 1 year ago

sdoire commented 1 year ago

This PR includes the majority of changes from the original move-fixtures PR. It includes everything but the changes to the scripts, which will be done in a future PR. From the original PR:

"This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example.

Finally, all fixtures have been updated."

In addition, the updating of fixtures exposed that a number of custom extractors are now out of sync with the current website layouts for their domains, so all of those have been updated so that the custom extractor unit tests are now passing again.

flbn commented 1 year ago

wondering how i can help you close this ticket so we can move onto the second half of the PR?

sdoire commented 1 year ago

@flbn Thanks for finding that issue with the .DS_Store! I could use a second set of eyes to do a review of the PR and do a formal approval (if it looks OK to you, of course). I know it's a lot to review, but it's mostly just updated fixtures, alongside updates of custom parsers and their related unit tests. If you have the bandwidth to take a review on, that would be amazing, but I totally get it if you have too much other stuff going on!

flbn commented 1 year ago

can definitely give it a look over!

flbn commented 1 year ago

heck yea, i didn't know how to get those files to update (build:generator) and it looks like it needed a bit of tweaking to generate the correct build... i ended up deleting the comment where i mentioned the extractor didn't match the custom generator output so i'm not sure if that's what inspired the commit but i ended up finding the correct file to look in later so i deleted that original comment, guess that mistake was kinda lucky 😅 😅

sdoire commented 1 year ago

@flbn I did see your comment before it was deleted, which made me curious as to how that file does get updated. Found it after a little digging around. 😸

flbn commented 1 year ago

the name of the file fixtures/www.washingtonpost.com/1480364838420.html was changed to fixtures/www.vulture.com--content-test.html, but that file seems to never be used?

flbn commented 1 year ago

all references to deleted files were removed 👍

flbn commented 1 year ago

all references to files that were renamed are correctly referenced to 👍

flbn commented 1 year ago

all files added are referenced in their respective extractors/tests 👍

sdoire commented 1 year ago

the name of the file fixtures/www.washingtonpost.com/1480364838420.html was changed to fixtures/www.vulture.com--content-test.html, but that file seems to never be used?

This is showing up as being changed to fixtures/www.washingtonpost.com--other.html for me. fixtures/www.vulture.com--content-test.html is being used for a unit test in score-content.test.js.

Screen Shot 2022-12-12 at 1 38 59 PM

flbn commented 1 year ago

This is showing up as being changed to fixtures/www.washingtonpost.com--other.html for me. fixtures/www.vulture.com--content-test.html is being used for a unit test in score-content.test.js.

ahh yes, that's what i meant! the file fixtures/www.washingtonpost.com--other.html is created but i couldn't find it used anywhere (like you mentioned, a similar file fixtures/www.vulture.com--content-test.html is used for a unit test, but not the washington post alternative)

sdoire commented 1 year ago

This is showing up as being changed to fixtures/www.washingtonpost.com--other.html for me. fixtures/www.vulture.com--content-test.html is being used for a unit test in score-content.test.js.

ahh yes, that's what i meant! the file fixtures/www.washingtonpost.com--other.html is created but i couldn't find it used anywhere (like you mentioned, a similar file fixtures/www.vulture.com--content-test.html is used for a unit test, but not the washington post alternative)

Ah, I see now! That unused fixture has been removed. Thanks!