plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

HTTP downloads don't support If-Modified-Since #32

Open MattBlissett opened 2 years ago

MattBlissett commented 2 years ago

GBIF use an HTTP If-Modified-Since request to avoid reprocessing data that hasn't changed:

curl -Ssv --remote-time --output clements.zip --time-cond clements.zip https://hosted-datasets.gbif.org/datasets/clements.zip

(Run this twice, and the second time the response is HTTP/1.1 304 Not Modified, and the file isn't downloaded.)

It would be very useful if the Plazi TB server could support this, otherwise it causes problems when thousands of datasets are reprocessed -- updates to the dataset trigger further updates, and we end up in a loop.

curl -Ssv --remote-time --output 463.zip --time-cond 463.zip http://tb.plazi.org/GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip (Gives HTTP/1.1 200 every time.)

gsautter commented 2 years ago

@MattBlissett I see ... will add a hash based check to make sure update notifications only ever go out if there's an actual modification.

gsautter commented 2 years ago

@MattBlissett added an MD5 based catch now ... meaning to say that if the actual DwC-A doesn't change as a result of a re-export, there won't be any notifications to your API anymore, and the last modified timestamp at least isn't updated.

Sorry for any inconvenience.

gsautter commented 2 years ago

@MattBlissett regarding the curl command, could you use a -vvv execution to get the headers this sets? More than happy to add support for those headers in the DwC-A provider servlet.

MattBlissett commented 2 years ago

For curl, this URL behaves as I'd like (since it's just Apache, which supports If-Modified-Since). I've cut out the DNS, TCP and irrelevant headers, and found a non-HTTPS URL:

# Download the file and with --remote-time set the file's modification time to the Last-Modified time returned by the server:
$ curl -Ssv --remote-time --output data.zip http://download.gbif.org/mpodolskiy/dataset-with-hosts_20211013.zip
...
< HTTP/1.1 200 OK
< Last-Modified: Wed, 13 Oct 2021 08:14:28 GMT
...
$ TZ=GMT ls -l data.zip
-rw-r--r-- 1 matt blissett 59,124,838 2021-10-13 08:14 data.zip

# Then repeat, but use --time-cond to make Curl send a conditional (If-Modified-Since) request, based on the timestamp of data.zip:
$ curl -Ssv --remote-time --output data.zip --time-cond data.zip http://download.gbif.org/mpodolskiy/dataset-with-hosts_20211013.zip
...
> GET /mpodolskiy/dataset-with-hosts_20211013.zip HTTP/1.1
> If-Modified-Since: Wed, 13 Oct 2021 08:14:28 GMT
...
< HTTP/1.1 304 Not Modified
# and nothing was downloaded.

# If the file were too old (fake it with touch -d '2000-01-01 00:00:00' data.zip) then
# the second curl command would overwrite data.zip as usual, with a 200 response.

We have this implementation in the IPT: https://github.com/gbif/ipt/blob/master/src/main/java/org/gbif/ipt/action/portal/ResourceFileAction.java#L73-L79

gsautter commented 2 years ago

Well, on our setup, you only see Apache, as we use it as a proxy, but safe for very few things, it really only is a proxy for a Tomcat running behind it, so I'll have to support the If-Modified-Since header in Tomcat ...

Shouldn't be a problem, though, as Java is pretty good at parsing dates of all sorts of formats ...

gsautter commented 2 years ago

Would it be helpful to include the actual modification time? I.e., setting the message to "Last Modified " instead of plainly "Not Modified"? Status code remains 304, of course.

MattBlissett commented 2 years ago

We'll just use the 304 status code, but the specification says adding Last-Modified "might be useful", so it may as well be included: https://httpwg.org/specs/rfc7232.html#status.304

gsautter commented 2 years ago

OK, happy to include it ... and done ... please try any DwC-A of your choice.

gsautter commented 2 years ago

Couldn't seem to the the Last-Modified header through with the 304 response ... however, after some playing around, turns out that said header comes out of the Tomcat just fine ... but gets filtered out by the proxying Apache ... something new to Google ...

gsautter commented 2 years ago

After two more hours of Googling and digging through Apache config, there doesn't seem to be a way to tell Apache to relay the Last-Modified header in a 304 response ... sorry.

gsautter commented 2 years ago

And here comes the problem with the hash based approach ... since the meta files contain export timestamps, the hash will be different every single time ... need to think of a different approach, or filter the hash somehow.

MattBlissett commented 2 years ago

The 304 response is enough for us, there's no need to rethink the hashing.

I already see "Not Modified" records in our logs and registry, e.g. https://registry.gbif.org/dataset/1e4ab867-cf67-4105-9918-ebb52be7f061/crawling-history

Thanks very much,

gsautter commented 2 years ago

The 304 response is enough for us, there's no need to rethink the hashing.

Well, thank you, but I do need the hash to decide whether or not to replace the DwC-A, and replacing it always also updates the modification time ... need not only consider re-exports due to data changes, but also ones incurred by changes to metadata formatting (like the current big one) or TSV structure, so the hash has to cover more than the source treatments alone.

gsautter commented 2 years ago

And the next complication ... Java's ZIP facilities actually set the modification time for ZIP file entries to "right now" unless some other timestamp is provided ... do you make use of the modification timestamps of the individual TSVs in any kind of way?

gsautter commented 2 years ago

Finally ... managed to hash around everything but content data, and also around any chunks of content data that involve the export time on one form or another ... and now the hashes seem to be stable, so no more redundant updates, not even to the modification time.

That said, sorry for the update flurry on 93F443DCBCEDFBF26165C392E1E8901C ... had to test if the archive is really stable.

MattBlissett commented 2 years ago

Thanks again -- this was working, but seems to have stopped working at some point this month. I see a '200' response every time now.

gsautter commented 2 years ago

Well, the only changes that happened were (a) an upgrade of the JRE our Tomcat runs in and (b) the addition of a CORS filter for one specific sub-path that has nothing to do with delivering the DwC-As ...

In addition, the hash based change tracking I also added back in February is still in place, so to prevent poking your API unless a DwC-A actually changed, and that does seem to be working properly.

I did, however, make a few changes to the DwC-A packing routine late last week (to properly include cited taxa, which are presumably synonyms), and repacked quite a number of DwC-As after that ... so a good few were actually modified.

Which DwC-A did you see the change in behavior on? Please let me know, then I'll track down the problem.

MattBlissett commented 2 years ago

I think it's affecting all datasets, though I have only tried a few. I'm testing with curl (as above):

curl -Ssv --remote-time --output test.zip --time-cond test.zip http://tb.plazi.org/GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip

Running this a second time shows a 200 response, when I'd expect a 304.

curl -Ssv --remote-time --output test.zip --time-cond test.zip http://tb.plazi.org/GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip
> GET /GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip HTTP/1.1
> Host: tb.plazi.org
> User-Agent: curl/7.74.0
> Accept: */*
> If-Modified-Since: Sun, 30 Jan 2022 20:50:57 GMT
>
< HTTP/1.1 200
< Date: Mon, 11 Jul 2022 17:08:58 GMT
< Server: Apache/2.4.53 (Debian)
< Last-Modified: Sun, 30 Jan 2022 20:50:57 GMT
< Content-Type: application/zip
< Content-Length: 12309
< Via: 1.1 srv1.plazi.org
< Access-Control-Allow-Origin: *
gsautter commented 2 years ago

Reproduced ... looking into it.

gsautter commented 2 years ago

One thing I can say for sure ... I did not modify the code checking the file modification time against the If-Modified-Since header ... must be something else, maybe blocking out that header ...

gsautter commented 2 years ago

@MattBlissett by any chance, do you have a recollection as to when it stopped working, or when you first observed it?

MattBlissett commented 2 years ago

I think you may have just fixed it as the curl command above now works. (Thanks!)

I did wonder if you have something odd with time zone settings. I have timestamps (from the Last-Modified header) that are exactly 8 hours changed, but the file is identical:

21,309 2022-02-03 04:16 ece8c67d-d18a-4047-8a74-b2974c293214.233.dwca
21,309 2022-02-02 20:16 ece8c67d-d18a-4047-8a74-b2974c293214.250.dwca

29,109 2022-05-03 15:03 e9be13b4-b13b-458e-af94-7f9118030ec7.215.dwca
29,109 2022-05-03 08:03 e9be13b4-b13b-458e-af94-7f9118030ec7.222.dwca
gsautter commented 2 years ago

Actually, I didn't change anything at all so far, was merely sifting through the code involved, looking for potential sources of the problem, so I could exclude them one by one and narrow down the problem ...

gsautter commented 2 years ago

Looks like the CORS filter causes the issue ... with the If-Modified-Since header listed as allowed, the timestamps are fine, but omitting If-Modified-Since in that list results in the timestamps being off by 8 hours ... and that even though said filter is configured to apply only to one specific path under http://tb.plazi.org/GgServer/, which is not http://tb.plazi.org/GgServer/dwca ... Google doesn't seem to know anything about the issue, and the CORS filter documentation doesn't say anything, either, but I tried it several time back and forth, and it looks like causality ... going to try and mess with this a bit more, but now at least we seem to have a working setup ...

MattBlissett commented 1 year ago

Just noting that I no longer see the 304 Not Modified responses. (I have a Plazi URL in an integration test.)

Initial download is fine and includes a Last-Modified header:

gbif-httputils > curl -Ssv --remote-time --output test.zip --time-cond test.zip http://tb.plazi.org/GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip
...
>
< HTTP/1.1 200
< Date: Thu, 01 Dec 2022 21:49:18 GMT
< Server: Apache
< Last-Modified: Sun, 30 Jan 2022 20:50:57 GMT
...

gbif-httputils > ls -l test.zip
-rw-r--r-- 1 matt blissett 12,309 2022-01-30 21:50 test.zip

Subsequent download with a correct If-Modified-Since request header gives another 200 response, when we'd prefer a 304 Not Modified:

gbif-httputils > curl -Ssv --remote-time --output test.zip --time-cond test.zip http://tb.plazi.org/GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip
> GET /GgServer/dwca/4636575DFFA7FFEDFFFBFFE4FA06EE20.zip HTTP/1.1
> Host: tb.plazi.org
> User-Agent: curl/7.68.0
> Accept: */*
> If-Modified-Since: Sun, 30 Jan 2022 20:50:57 GMT
>
< HTTP/1.1 200
< Date: Thu, 01 Dec 2022 21:49:23 GMT
< Server: Apache
< Last-Modified: Sun, 30 Jan 2022 20:50:57 GMT
...
gsautter commented 1 year ago

@MattBlissett just tested it, and was able to reproduce the behavior ... for some weird reason, on the download the backing servlet sets the Last-Modified response header in GMT-0800 rather than plain UTC ... and the Apache proxy apparently strips the correct timezone info and appends GMT instead, indicating modification 8 hours before the actual time, so the downloaded zip get an older modification timestamp than the actual one ... and that too old timestamp then goes back in the If-Modified-Since header of the next request and is compared to the actual file timestamp, which is 8 hours later ... need to do some more digging to figure out what causes the timezone mix-up.

MattBlissett commented 1 year ago

I set all our servers to UTC just to avoid this type of problem, but I have noticed some Docker containers (etc) have had their default set to California time. export TZ=UTC or docker ... -e TZ=UTC ... might be an easy fix.

gsautter commented 1 year ago

Found the problem ... the DateFormat object used to parse the If-Modified-Since timestamps remembers the last timezone used, and that same DateFormal was used for formatting the Last-Modified timestamps as well ... and the Apache proxy somehow adjusted the time zone ...

Anyway, fixed now. Please let me know if you observe any more anomalies.