nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Slightly different strain names in sequences.fasta vs metadata #103

Closed AngieHinrichs closed 3 years ago

AngieHinrichs commented 4 years ago

Current Behavior
While the vast majority of today's 150k strain names match between sequences.fasta and metadata downloaded from GISAID, there are 16 sequences with slightly different names in sequences.fasta compared to metadata.

Expected behavior
Ideally 100% of the names could match, so that the metadata for every sequence in sequences.fasta could be automatically associated with the sequence.

How to reproduce

  1. Download nextmeta and nextfasta from GISAID (currently metadata_2020-10-20_07-16.tsv.gz and sequences_2020-10-20_07-17.fasta.gz)

  2. Extract and compare strain names in a command shell:

gunzip -c metadata_2020-10-20_07-16.tsv.gz | tail -n +2 | cut -f 1 | sort > metadata.names
gunzip -c sequences_2020-10-20_07-17.fasta.gz | grep ^\> | sed -e 's/^>//' | sort > sequences.names
comm -3 metadata.names sequences.names
  1. Differences (first column is name that appears in metadata_2020-10-20_07-16.tsv.gz, second column is name that appears in sequences_2020-10-20_07-17.fasta.gz:
    /Italy/SIC-IZSBM169-57/2020
    /Italy/SIC-IZSBM169-76/2020
    Bangladesh/BCSIR-NILMRC-125/2020
    Bangladesh/BCSIR-NILMRC-131/2020
    Bangladesh/BCSIR-NILMRC-139/2020
    Bangladesh/BCSIR-NILMRC-145/2020
    Bangladesh/BCSIR-NILMRC-146/2020
    Bangladesh/BCSIR-NILMRC-150/2020
    Bangladesh/BCSIR-NILMRC-151/2020
    Bangladesh/BCSIR-NILMRC-152/2020
    Bangladesh/BCSIR-NILMRC-153/2020
Bangladesh/BCSIR-NILMRC_125/2020
Bangladesh/BCSIR-NILMRC_131/2020
Bangladesh/BCSIR-NILMRC_139/2020
Bangladesh/BCSIR-NILMRC_145/2020
Bangladesh/BCSIR-NILMRC_146/2020
Bangladesh/BCSIR-NILMRC_150/2020
Bangladesh/BCSIR-NILMRC_151/2020
Bangladesh/BCSIR-NILMRC_152/2020
Bangladesh/BCSIR-NILMRC_153/2020
India/HR-THSTI-42/2020
    India/HR-THSTI-BAL-42/2020
Italy/SIC-IZSBM169-57/2020
Italy/SIC-IZSBM169-76/2020
Spain/CT-015D0/2020
Spain/CT-2020030095/2020
    Spain/CT-IrsiCaixa-015D0/2020
    Spain/CT-IrsiCaixa-2020030095/2020
Switzerland/BS-42169171/2020
    Switzerland/BS-UHB-42169171/2020
env/Italy/LAZ-INMI-HSacco-1/2020
    env/Italy/LOM-INMI-HSacco-1/2020

Perhaps some name cleanups are applied to the metadata strain names that could also be applied to the sequence strain names?

Thank you so much for providing nextmeta and nextfasta, they're indispensable!

AngieHinrichs commented 3 years ago

The two sequences with "/Italy" in the fasta headers have been fixed, but there are still slight differences between nextmeta strain and fasta header for the other 14 in nextmeta and nextfasta downloaded today. Metadata strain column on the left, fasta header name on the right:

sdiff <(comm -23 metadata.names sequences.names) <(comm -13 metadata.names sequences.names)
Bangladesh/BCSIR-NILMRC_125/2020                              | Bangladesh/BCSIR-NILMRC-125/2020
Bangladesh/BCSIR-NILMRC_131/2020                              | Bangladesh/BCSIR-NILMRC-131/2020
Bangladesh/BCSIR-NILMRC_139/2020                              | Bangladesh/BCSIR-NILMRC-139/2020
Bangladesh/BCSIR-NILMRC_145/2020                              | Bangladesh/BCSIR-NILMRC-145/2020
Bangladesh/BCSIR-NILMRC_146/2020                              | Bangladesh/BCSIR-NILMRC-146/2020
Bangladesh/BCSIR-NILMRC_150/2020                              | Bangladesh/BCSIR-NILMRC-150/2020
Bangladesh/BCSIR-NILMRC_151/2020                              | Bangladesh/BCSIR-NILMRC-151/2020
Bangladesh/BCSIR-NILMRC_152/2020                              | Bangladesh/BCSIR-NILMRC-152/2020
Bangladesh/BCSIR-NILMRC_153/2020                              | Bangladesh/BCSIR-NILMRC-153/2020
India/HR-THSTI-42/2020                                        | India/HR-THSTI-BAL-42/2020
Spain/CT-015D0/2020                                           | Spain/CT-IrsiCaixa-015D0/2020
Spain/CT-2020030095/2020                                      | Spain/CT-IrsiCaixa-2020030095/2020
Switzerland/BS-42169171/2020                                  | Switzerland/BS-UHB-42169171/2020
env/Italy/LAZ-INMI-HSacco-1/2020                              | env/Italy/LOM-INMI-HSacco-1/2020

I don't have access to GISAID's JSON data so I can't just run locally to debug. I will ask for access.

emmahodcroft commented 3 years ago

Thanks Angie! I've tracked this down - they're all strain names we're modifying manually in ncov-ingest - but I'm not sure why the fasta files aren't modified as well... that shouldn't be the case.

For some of these I can tell why we started modifying them, as originally the strain name included characters it shouldn't (like >) - but for others I'm less clear. We should be able to solve this by just removing all these manual changes - but I'm checking with the rest of the team first to ensure I know why we did this originally, to ensure we won't break anything by undoing it!

Thanks for flagging though, this is definitely something we need to resolve!

AngieHinrichs commented 3 years ago

Thank you for looking into this, Emma! Omg I was grepping for them in the code and now I see they've been in gisaid_annotations.tsv all along. <:)

From transform-gisaid line 182 it looks like if --sorted-fasta is used then the metadata and fasta cannot disagree because they use the same entry['strain']. However, a different (shorter) pipeline is used for fasta if --sorted-fasta is not used. That pipeline does not include MergeUserAnnotatedMetadata(annotations), so entry['strain'] is never modified for the fasta output.

I see the comment that --sorted-fasta requires a lot more memory, makes sense. If adding MergeUserAnnotatedMetadata(annotations) to the unsorted fasta pipeline would slow it down noticeably, then separating out only the strain name changes (with something like pipe from awk '$3 == "strain"', or perhaps making a separate file for them, would probably make it pretty lightweight -- there are only 87 of those lines in gisaid_annotations.tsv currently, and since only 14 strain values are mismatching between metadata and sequences, I bet the other 73 have already been fixed upstream. :)

eharkins commented 3 years ago

@AngieHinrichs this should be fixed in #110. Let us know if it continues to be an issue. Thanks for flagging and diagnosing!

AngieHinrichs commented 3 years ago

Excellent, thank you for fixing it!