nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Removal of genome containing plasmid sequence #28

Closed Rohit-Satyam closed 5 months ago

Rohit-Satyam commented 7 months ago

Dear @j23414

I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include

AY243466
AY243467
AY243468
AY243469
AY376438
AY648301
AY656167
AY656168
AY656169
AY656170
AY744148

Shouldn't these be either removed or the plasmid sequence be chopped off? I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)

newplot (3)

Rohit-Satyam commented 7 months ago

Similarly, there are genomes with ~61% of the genome as Ns. GISAID uses at-least 50% genome assembly completeness as mandatory to assign lineage to an assembly. If we use a similar analogy these genomes might also be removed. All these 13 genomes come from the study https://pubmed.ncbi.nlm.nih.gov/19965886/. The submitters somehow managed to meet the length requirement and reported these genomes as a complete assembly. What do you think?

These 13 genomes have <50% of genome assembly covered and come from this 2010 study.
GU318306
GU318307
GU318308
GU318309
GU318310
GU318311
GU318312
GU318313
GU318314
GU318315
GU318316
GU318317
GU318318
j23414 commented 7 months ago

Hello @Rohit-Satyam, thanks for flagging those sequences for removal!

We tend to pull the entire NCBI dengue dataset into the sequences FASTA file and then subsequently filter out any problematic ones during the phylogenetic build. I've put the flagged circular strains in the phylogenetic/config/exclude.txt file in this PR: https://github.com/nextstrain/dengue/pull/29

Thanks for the violin plot! The upper and lower bounds should be adding in an augur filter call. Currently we're filtering out strains that are shorter than 5000nt, defined in a config file but your assessment of a lower boundary (i.e. 8670bp) makes sense to me...just need to glance through the tree to see if there are potential issues.

We don't seem to have a --max-length flag, but we should be able to check for sequences that are too long or too distant from the reference sequence, to be added to the "exclude.txt" file.

I'm still looking into genomes with ~61% of the genome as Ns you've flagged. I'll go through the links and background and check if they're making it through our current filters or add them to the exclude list.

Rohit-Satyam commented 7 months ago

This is not a query. Just some exploratory analysis and mental notes on why not to rely on the genome completeness filter of the NCBI Virus database. Besides ViruSurf uses outdated genome assemblies so avoid using their resource for time being.

I observed the most exhaustive set of DENV genomes has been provided by Nextstrain. On the contrary, the NCBI Virus database only provides merely 4K complete genomes when using their "complete" genome filter). To investigate reason behind this I tried to intersect the Accession numbers obtained from metadata from NCBi and Nextstrain and the results are shown below.

jVenn_chart

We can see that there are 9745 private to Nextstrain. But now I know that this nuisance arose because the NCBI virus tagged many complete assemblies as incomplete and many (nearly 85) incomplete genomes of merely 1.4K length as complete. I have requested them to correct this. Only MN192436 is missing from Nextstrsin.

@j23414 There are duplicates in Nextstrain i.e. out of 14522, 13736 are unique. Deduplication was done using vsearch. Further, these duplicates arise from some old accession numbers associated with new accession numbers. Eg:

vsearch --derep_fulllength all.fasta --output drep.fasta
DEN2CGA
DEN2JAMCG

image

Rohit-Satyam commented 7 months ago

@j23414 Besides do you guys have any plans for developing a similar resource for FMD virus?

j23414 commented 7 months ago

Hi @Rohit-Satyam, no plans for a FMD resource yet.

Might be able to use the pathogen-repo-guide but nothing planned yet

j23414 commented 7 months ago

Thanks for flagging the 13 strains with many Ns! It gave me incentive to look into the augur min-length filter (to check if N's are counted when calculating sequence length).

For the test, I pulled the flagged strains and metadata into NN.fasta and NN.tsv files respectively. Looks like it disregards N's when calculating length, so these will be filtered out by the --min-length 5000 parameter, but thanks for checking!

augur filter \
  --min-length 5000 \
  --sequences NN.fasta \
  --metadata NN.tsv \
  --metadata-id-columns accession \
  --output-strain passed.txt

#> 13 strains were dropped during filtering
#>   13 of these were dropped because they were shorter than minimum length of 5000bp
ERROR: All samples have been dropped! Check filter rules and metadata file format.
j23414 commented 7 months ago

I'm working through the duplicates. I played around with vsearch although I ended up using smof uniq to aggregate duplicate names into the fasta header:

conda install bioconda::smof
smof uniq \
  --pack data/sequences_all.fasta \
  --pack-sep ';' \
  > packed_uniq.fasta

grep ">" packed_uniq.fasta \
| grep ";" \
> duplicates.txt

head duplicates.txt
#> >ON123563;ON123566
#> >ON123564;ON123565;ON123567
#> >ON123568;ON123569;ON123570
#> >LQ868394;OG164735;LQ729835;JC562943;LQ868377;OG164704;AY099336;NC_001475;ON103302;ON103390;ON111445;ON115021;ON115023;ON115025;ON115029;ON115182;ON115811;ON115814;ON115816;ON115818;ON116129;ON127418;ON127533;ON127537;ON127550;ON127552
#> >LY683285;HQ026763;

I think a bunch of these will disappear if we filter to only "VRL" genbank files, trying to track down if we can get that field in ncbi datasets, but I'll draft out a manual exclude until then.

LOCUS       AF326573               10649 bp    RNA     linear   VRL 12-FEB-2004
LOCUS       LP918029               10649 bp    DNA     linear   PAT 29-APR-2018

Only MN192436 is missing from Nextstrain.

Will look into why this is missing

Rohit-Satyam commented 7 months ago

@j23414 nice work. I appreciate your efforts. I would also suggest you uniformize the subclade for DENV serotype 2 and serotype 4. Subclades for serotype 1 and 3 are Roman numerals ("DENV1/V" and "DENV3/III" ) while subclades for serotype 2 and 4 are characters like "S" or "AA" ("DENV4/S" and "DENV2/AA" ).

unique(nextstrain$subclade)
 [1] ""          "DENV1/V"   "DENV3/III" "DENV2/C"   "DENV1/III" "DENV2/AA" 
 [7] "DENV4/II"  "DENV2/AI"  "DENV1/I"   "DENV4/I"   "DENV3/I"   "DENV1/IV" 
[13] "DENV3/IV"  "DENV3/II"  "DENV2/AM"  "DENV2/S"   "DENV1/II"  "DENV4/S"  

If you could change the following:

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II
DENV2/S --> DENV2/VI 
DENV4/S --> DENV4/IV

The roman numerals represents the genotype and I got this information from the following:

DENV-2 serotype is classified into genotypes I (American), II (Cosmopolitan), III (Asian/American), IV (Asian II), V (Asian I) and VI (Sylvatic). Source:

Genotype IV is sylvatic, with only three known sequences. Source 1 and we know P73_1120 is sylvatic strain

Rohit-Satyam commented 7 months ago

@j23414 I would like to bring your attention to the following subclade assignments

The subclade of the genome GQ398268.1 is designated as DENV2/AI in nextclade annotation which means this is Asian I (i.e genotype V). However, according to a supplementary material here says it's Asian II (i.e genotype IV).

GQ398268 Dengue virus 2 strain DENV-2/ID/1022DN/1975, complete genome Homo sapiens Indonesia 1975 Asian II

Similarly OK573328 is assigned to DENV2/AI while the strain field says it's Type_2/New_Guinea_C_NGC_ and a study here says and I quote

Similarly, the NIH DENV2 component of the tetravalent TV003 vaccine, based on the New Guinea C 1980 strain from the Asian II genotype...

To confirm if it is Asian II, I carried out blast of with OK573328

KM204118 Dengue virus 2 strain New Guinea C, complete genome Homo sapiens Papua New Guinea 1944 Asian II

Which came out as follows with 99% identity and query cover:

image

Rohit-Satyam commented 7 months ago

@j23414 and I think there are more. Below is a list of 16 genomes that needs correction at subclad level based on the table obtained from supplementary material link highlighted above. Please find the attachment below

temp_check.csv

Can you tell me how these subclad assignments were achieved by you?

j23414 commented 7 months ago

Hi @Rohit-Satyam , sorry for the late answer!

The subclade assignments were outdated (from before we had DENV2/AII (aka DENV2/IV) classification). I've updated the nextclade_subtype (last column) in the following file (notice the extra "workflows/" in the URL).

You can also update the URL's in the phylogenetic/rules/prepare_sequences.smk file, and it should automatically pull the correct ones.

https://github.com/nextstrain/dengue/blob/5385608d8ee66f83c01996321b977cd293603432/phylogenetic/rules/prepare_sequences.smk#L24-L25

On our end, we're discussion how to smoothly transition to the the DENV2/IV style, probably add it as a separate field for a while then deprecate the old field.

Rohit-Satyam commented 7 months ago

Hi, @j23414 thanks for working on this promptly and correcting the subclade assignment of the aforementioned genomes. I was wondering if we can carry out BlastN of sequences with unknown serotypes with the ones that have known serotypes and then use their metadata to assign serotypes/genotype to the unknown ones using some stringent threshold such as 90% query cover and identity or use something like MeshClustV3? Besides when I see your exclude.txt file, I see some sequences being "too divergent" how do you decide those sequences? Did you calculate msa entropy or something?

Rohit-Satyam commented 7 months ago

Hi @j23414. I downloaded new metadata and tried to annotate some more genomes at least at the serotype level and some at the genotype level. Below attached is the Excel sheet. Some are literature-based (dug some supplementary material files) and another sheet is based on serotype information based on fields like serotype and description present in the GenBank file using code below. Now I know that it's not a perfect code but it does the job and helped me manually annotate several genomes quickly. Please have a look. I've included below the code I used but I think you can optimize it. If you do let me know

while read p; 
do 
efetch -db nuccore -id "$p" -format FEATURES > file.temp; 
note=$(grep note file.temp | awk -F'=' '{print $2}' | tr '\n' ':' | sed 's/ *\/note="//'); 
strain=$(grep "strain=" file.temp | tr '\n' ' ' | sed 's/ *\/strain="//'); 
org=$(grep organism file.temp | sed 's/ *\/organism="//'); 
isolate=$(grep "isolate=" file.temp | tr '\n' ' ' | sed 's/ *\/isolate="//'); 
def=$(grep DEFINITION file.temp); 
sero=$(grep serotype= file.temp | sed 's/ *\/serotype="//'); 
echo -e "$p;$strain;$note;$org;$isolate;$def;$sero" >> results.tsv; 
done < temp

## where temp is file containing the accessions of genomes that do not have serotype or genotype information

serotype.xlsx

Rohit-Satyam commented 7 months ago

@j23414 did you check for the genomes above?

j23414 commented 7 months ago

Hi @Rohit-Satyam! Sorry, bogged down with prepping for a presentation next week. I plan to come back to this soon.

From a glance through, they seem reasonable to add to "annotations.tsv", basically figuring out how to mark the reasoning for each manual change (e.g. from genbank, from linked publication, from nearest blast hit).

j23414 commented 5 months ago

Closing this since merged

The genotype renaming is being discussed on a separate issue

Of course, feel free to submit new issues to flag any problematic records