Closed trvrb closed 8 years ago
I think there is a bug with updating the timestamp
field because it should just be the last time the document was updated or inserted. I'll open up a separate issue for that.
So the inclusion_date
is just the very first time the document is added to the database. This would be formatted like collection_date
as YYYY-MM-DD
.
Yes. Exactly. inclusion_date
is first time a document appears and is formatted just like collection_date
.
Another thing to consider, each virus document and sequence document will have an inclusion_date
. When downloading we merge together the virus document and sequence document with this command.
command = r.table(sequence_table).merge(lambda sequence: r.table(virus_table).get(sequence[index]))
If the virus and sequence document have different inclusion_date
value then rethinkdbs merge
command defaults to the rightmost document in the merge command which would keep the virus inclusion_date
. I think it makes more sense to keep the sequence inclusion_date
since there might be multiple sequences per virus but this would require some work on the download side to adjust the merge command.
Hmm.... I think I'm okay with attaching inclusion_date
to virus when downloading FASTAs. As a concrete example, we'll often want to know something like when did A/HongKong/4801/2014
first appear in the database. More sequences can appear later, but that's not the main interest.
I do like having an inclusion_date
for each sequence and an inclusion_date
for each virus in the table. This just becomes a question of how to the merge when downloading.
Okay so say:
A/HongKong/4801/2014
has an inclusion_date
2014-04-01
The first sequence uploaded with it, EPI1
also has an inclusion_date
2014-04-01
A second sequence uploaded, EPI2
has a later inclusion_date
2014-08-31
Right now the command above would download EPI1
and EPI2
and they would both keep A/HongKong/4801/2014
's inclusion_date
2014-04-01
.
But this seems to be okay because we care more about when the virus is first uploaded. Both the sequence and virus will have inclusion_date
field though.
Hmm.... I see. Thinking more, what if we had virus_inclusion_date
and sequence_inclusion_date
fields. The merge could include one or both in the resulting FASTA. Seems a bit cleaner perhaps. What do you think?
I like that! They'd both be left after the merge and can be downloaded to the resulting fasta if needed.
Exactly. I like it.
I believe this works now. I also added the fields to current documents in vdb
and tdb
defaulted to 2016-09-03
. Also reminder that the inclusion_date
and timestamp
fields are based off utc time.
Fantastic! Thanks so much for making this happen @chacalle.
I've realized it would be totally helpful to have a
inclusion_date
field in addition to thecollection_date
field in vdb. As a simple use case, I just uploaded new Zika sequences from ViPR and in doing so all the timestamps we set to present and it wasn't at all easy to find what was new. It would have been super handy to be able to sort byinclusion_date
. This would also allow us to roll about the visualization and see what data was available when.My proposal would be just to use whatever date it is when a document is first added to the database. This should be for appearance in our database, not when the sequence first appeared in GenBank.
An eventual goal is to be able to have a nextstrain.org/updates/ page that would list atomic updates to the app and link to new viruses added in each update. This could pretty easily be done by passing the
inclusion_date
to augur/auspice from vdb.