Closed xz-keg closed 10 months ago
It seems that most of the seqs of this branch are dropped off by usher due to unknown reasons.
They have less than 5 reversions(mostly the S:157/158 artefact). @AngieHinrichs
It may simply be due to the delay in our downloading sequences from GISAID (in chunks of 5,000 at a time; this is a nightly chore for a junior member of our lab and sometimes they miss a day) and adding them to the tree. Some of the missing sequences are EPI_ISL_1863xxxx which we don't have yet. I would have expected the EPI_ISL_1861xxxx sequences like EPI_ISL_18617458 (Denmark/DCGC-665632/2023) to have been added before today, but I see we did not have EPI_ISL_18617458 in yesterday's data. We have EPI_ISL_18617458 today, and it is being added to the 2023-12-15 tree. The EPI_ISL_1863xxxx sequences should be added to the tree within a couple days.
Thank you @AngieHinrichs ! @aviczhl2 if you use "Usher test" sometime you get the new update a bit before, just few hours : https://genome-test.gi.ucsc.edu/cgi-bin/hgPhyloPlace
Thank you @AngieHinrichs ! @aviczhl2 if you use "Usher test" sometime you get the new update a bit before, just few hours : https://genome-test.gi.ucsc.edu/cgi-bin/hgPhyloPlace
Thanks! I was just wondering why some 2023-11-submitted seqs(those EPI_1861xx) were not on the tree.
Thanks! I was just wondering why some 2023-11-submitted seqs(those EPI_1861xx) were not on the tree.
GISAID doesn't document this, but I think it's possible that their curation process sometimes delays sequences past their submission date. I believe that their curation process includes some automated checks for things like frameshifts that result in some sequences being bounced back to submitters for manual checking and confirmation. So submission date is not the same as release date. As far as I can tell, there is no separate record of release date. I can only tell when a sequence was first present in our manual downloads.
Thanks! I was just wondering why some 2023-11-submitted seqs(those EPI_1861xx) were not on the tree.
GISAID doesn't document this, but I think it's possible that their curation process sometimes delays sequences past their submission date. I believe that their curation process includes some automated checks for things like frameshifts that result in some sequences being bounced back to submitters for manual checking and confirmation. So submission date is not the same as release date. As far as I can tell, there is no separate record of release date. I can only tell when a sequence was first present in our manual downloads.
Thanks! I realize this too. But I always have trouble filtering out the newly-released seqs with a past submission date.
I'd like to learn your way to get newly-released sequences.
big jump from last upload by Netherlands
I'd like to learn your way to get newly-released sequences.
Since I do not have access to download the full GISAID dataset, my method is to maintain a local cache of sequences, updated by downloading up to 5,000 sequences at a time, which is a lot of tedious work for a junior member of my lab, and leads to having outdated local data because there are only so many chunks of 5,000 sequences that one can be expected to download every night. I do not recommend it if there is a better way, but it is what we have to do in order to keep the UShER tree as up to date as possible.
Nightly downloads from INSDC, COG-UK and CNCB are freely available and completely automated on my server. I can share scripts for those if you're interested.
But I assume you're asking about GISAID in particular. If you are among the lucky few who can see an icon labeled something like "nextmeta" in your Downloads tab then that is best! Save a copy every day, extract the EPI_ISL ID column, and compare the previous day's EPI_ISL set to today's set. But most registered users are not so lucky. There is a different file in the Downloads tab that may actually be useful although it does not seem to be updated every day (perhaps weekly?). Here is the first section that appears in my Downloads tab (it's not the same for every user):
I downloaded the icon labeled "spikeprot1213" which saved the file hCoV-19_spikeprot1213.tar.xz. Then I extracted its contents:
spikeprot1213/FASTA_header_format_for_allprot_spikeprot.txt
spikeprot1213/METHOD_for_generating_allprot_spikeprot.txt
spikeprot1213/readme.txt
spikeprot1213/spikeprot1213.fasta
The file spikeprot1213/spikeprot1213.fasta has headers with a format described in the FASTA_header*.txt file. Those include the accession. You can get a sorted list of accessions like this:
grep ^\> spikeprot1213/spikeprot1213.fasta | sed -e 's/.*\|EPI_ISL_/EPI_ISL_/; s/\|.*//;' | sort > IDs.1213
Then, when there is a new file (say "spikeprot1220" which doesn't exist yet but maybe will?), you can run a similar command to create the corresponding file IDs.1220. Then you can find out what sequences are new with this command:
comm -13 IDs.1213 IDs.1220 > newIDs.txt
38,Poland
please propose it
JN.2+T1030C+C11195T, C23277T,G28346T(Orf1a:L3644F,N:G25C,S:T572I )
GISAID query: C11195T, C23277T,G28346T No. of seqs: 19(Denmark 2 Belgium 1 Sweden 1 Germany 2 UK 1 Netherlands 12)
First: EPI_ISL_18576972, Netherlands, 2023-11-6 Latest: EPI_ISL_18613159,UK, 2023-12-5
usher