postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.45k stars 446 forks source link

Time zone difference causes tests to fail #549

Open Buratinator opened 4 years ago

Buratinator commented 4 years ago

Expected Behavior

I was expecting that date/time would be converted into the same value regardless of where the test is run.

Current Behavior

In local tests, date content is converted into a UTC date/time offset by my time zone. I'm in UTC+3, so 3 hours are subtracted from the date.

In automatic tests when I submit a PR, that same date is treated as true UTC+0 date.

Thus, this HTML:

    <div itemprop="datePublished" class="publication-date">
      <span class="publication-day">Apr 6</span>
      <span class="publication-year">2020</span>
    </div>

...with this date extractor:

  date_published: {
    selectors: [
      // enter selectors
      ".publication-date[itemprop='datePublished']",
    ],
  },

generates this on my local machine: 2020-04-05T21:00:00.000Z

But Postlight/circleci automatic tests at PR submission generate 2020-04-06T00:00:00.000Z

(https://circleci.com/gh/postlight/mercury-parser/4216?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

Being new to all this, I don't see a way to pass tests both on my local computer and upon the PR submission. I don't want to have to fudge with this. Is there a setting or environment variable to ensure uniform treatment of dates?

Steps to Reproduce

The page is at http://med.stanford.edu/news/all-news/2020/04/smart-toilet-monitors-for-signs-of-disease.html (I do apologize for the topic of that article lol).

Shepard commented 2 years ago

I ran into the same problem and found the following solution:

You can supply a timezone option to the date_published field in your extractor. In my case it looks like this:

  date_published: {
    selectors: [
      '.content__meta__date'
    ],
    timezone: 'Europe/Berlin'
  },

This should make the extractor return the same fixed result both on your machine and in the CI environment.

Originally I was hoping I could inject this option during the test somehow so that I could restrict it as a workaround for the test only. But then I realised that it actually makes sense to keep it in the extractor in my case: the extractor is for a German website, so the dates shown on the articles on the website should be interpreted according to Germany's timezone.

Now this might not make sense for all websites, especially not for international ones, and it's a bit annoying having to supply it in all extractors. So having a way to set a fixed timezone in tests (or for the test setup to do this for us) would still be nicer. Alternatively (or additionally) I'd like to see the documentation for custom extractors updated to describe the timezone and format options of the date_published extractor and potentially warn about such problems in tests.