Download newest data for existing query

RaphaelMeudec commented 2 years ago

Hello,

I'm opening this issue as using hash on query and vocabulary might not be ideal when dealing with a query that hits data from the current year.

Query hash

Using the query hash to avoid re-downloading stuff is a nice touch, but it can be a problem for users who want to download latest data.

A query like 'fMRI[Abstract]' run with a 30-days interval for example would not download the data from the last 30 days. The only option for the user to update the data would be to delete the existing folder for the query, which is not ideal either I guess. Maybe a --force option would be nice to force re-download.

Vocabulary hash

For subsequent steps such as vectorization, the hash and re-run of the step only depends on the vocabulary hash. I guess this is not ideal either as if a user downloads some more data, the processed elements would still point to the older data (either by using a force option as said in the previous point, or if they merge data directories by hand e.g)

Proposition

The option I have in mind to solve this is:

a --force tag on each step to force the execution even if it is considered complete in info.json
computing the hash of the vocabulary + data for the processing steps
- I'm not sure what is the cost of the hash for the whole data, maybe it could be a hash of the list of filenames only

jeromedockes commented 2 years ago

Thanks!! I think it's a good idea; I'm just slightly worried about the additional complexity.

The goal of these hashes was to provide unique but stable names for different queries and vocabularies rather than checking integrity.

Cache invalidation is hard so to keep things simple for now my idea was not to dive into it. If a step has completed successfully nqdc will not modify its output. So if it changes it is because the user has manually changed it; in that case I would leave it up to the user to also remove outputs that have been invalidated. In any case even if we checked hashes the user would still need to remember to re-run the necessary pipelines (eg for each different vocabulary they used) before using the final products.

Basically at least for now we would just aim for a coherent state when the data has only been modified by calls to nqdc. If a user modifies it manually they take the responsibility of managing that themselves. If a user wants new articles that have appeared on PMC since the download, they delete the query directory and re-run the pipeline.

I agree the documentation does not explain this clearly enough.

WDYT?

jeromedockes commented 2 years ago

I think this is now addressed in the documentation that states

If we run the same query again, only missing batches will be downloaded. If we want to force re-running the search and downloading the whole data we need to remove the articlesets directory.

(that may not be obvious or highlighted enough though :/)

therefore unless there are new comments I will close this issue in a few days

neuroquery / pubget