Inclusion date - Githubissues

trvrb commented 8 years ago

I've realized it would be totally helpful to have a inclusion_date field in addition to the collection_date field in vdb. As a simple use case, I just uploaded new Zika sequences from ViPR and in doing so all the timestamps we set to present and it wasn't at all easy to find what was new. It would have been super handy to be able to sort by inclusion_date. This would also allow us to roll about the visualization and see what data was available when.

My proposal would be just to use whatever date it is when a document is first added to the database. This should be for appearance in our database, not when the sequence first appeared in GenBank.

An eventual goal is to be able to have a nextstrain.org/updates/ page that would list atomic updates to the app and link to new viruses added in each update. This could pretty easily be done by passing the inclusion_date to augur/auspice from vdb.

chacalle commented 8 years ago

I think there is a bug with updating the timestamp field because it should just be the last time the document was updated or inserted. I'll open up a separate issue for that.

chacalle commented 8 years ago

So the inclusion_date is just the very first time the document is added to the database. This would be formatted like collection_date as YYYY-MM-DD.

trvrb commented 8 years ago

Yes. Exactly. inclusion_date is first time a document appears and is formatted just like collection_date.

chacalle commented 8 years ago

Another thing to consider, each virus document and sequence document will have an inclusion_date. When downloading we merge together the virus document and sequence document with this command.

command = r.table(sequence_table).merge(lambda sequence: r.table(virus_table).get(sequence[index]))

If the virus and sequence document have different inclusion_date value then rethinkdbs merge command defaults to the rightmost document in the merge command which would keep the virus inclusion_date. I think it makes more sense to keep the sequence inclusion_date since there might be multiple sequences per virus but this would require some work on the download side to adjust the merge command.

trvrb commented 8 years ago

Hmm.... I think I'm okay with attaching inclusion_date to virus when downloading FASTAs. As a concrete example, we'll often want to know something like when did A/HongKong/4801/2014 first appear in the database. More sequences can appear later, but that's not the main interest.

I do like having an inclusion_date for each sequence and an inclusion_date for each virus in the table. This just becomes a question of how to the merge when downloading.

chacalle commented 8 years ago

Okay so say: A/HongKong/4801/2014 has an inclusion_date 2014-04-01 The first sequence uploaded with it, EPI1 also has an inclusion_date 2014-04-01 A second sequence uploaded, EPI2 has a later inclusion_date 2014-08-31

Right now the command above would download EPI1 and EPI2 and they would both keep A/HongKong/4801/2014's inclusion_date 2014-04-01.

But this seems to be okay because we care more about when the virus is first uploaded. Both the sequence and virus will have inclusion_date field though.

trvrb commented 8 years ago

Hmm.... I see. Thinking more, what if we had virus_inclusion_date and sequence_inclusion_date fields. The merge could include one or both in the resulting FASTA. Seems a bit cleaner perhaps. What do you think?

chacalle commented 8 years ago

I like that! They'd both be left after the merge and can be downloaded to the resulting fasta if needed.

trvrb commented 8 years ago

Exactly. I like it.

chacalle commented 8 years ago

I believe this works now. I also added the fields to current documents in vdb and tdb defaulted to 2016-09-03. Also reminder that the inclusion_date and timestamp fields are based off utc time.

trvrb commented 8 years ago

Fantastic! Thanks so much for making this happen @chacalle.

nextstrain / fauna

Inclusion date #35