Use seqkit grep instead of faSomeRecords for accuracy

AlbertRockG commented 1 year ago

Description of proposed changes

I am opening this pull request to suggest the use of seqkit grep instead of faSomeRecords to extract Africans fasta records from the sequences_fasta.tar.xz file downloaded from GISAID.

Related issue(s)

I am making this suggestion because I noticed recently that the number of fasta records in the sequences_africa.fasta file obtained exceeds the number of entries in the strains_africa.txt file.

Testing

[ ] Checks pass

tsibley commented 1 year ago

With a little testing, there's a semantic change between faSomeRecords and seqkit grep -n. The first matches only on id (e.g. a FASTA definition line is >id description more description); the second matches on the id + description.

AlbertRockG commented 1 year ago

@tsibley thank you for your review. When I noticed that the number of records exceed the number of entries in strains_africa.txt file, I check the content of the sequences_africa.fasta file and found sequences from South Korea. Kindly find below screenshots of my terminal: Screenshot from 2023-03-29 00-56-22 From that observation, I suspect that faSomeRecords would have difficulty processing the id of sequences from South Africa, confusing them with sequences from South Korea. I tried seqkit grep -n and found it more efficient.

AlbertRockG commented 1 year ago

With a little testing, there's a semantic change between faSomeRecords and seqkit grep -n. The first matches only on id (e.g. a FASTA definition line is >id description more description); the second matches on the id + description.

As the ids are formatted by the provided script, I don't think this change in semantics matters. My idea is that faSomeRecords might have trouble processing ids that contain spaces. Thanks.

tsibley commented 1 year ago

Ah, that extra info is very helpful. I agree we do want to switch to seqkit grep -n.

The bug you're seeing is indeed due to the difference in FASTA between a sequence's id and it's full name/description. The id is the part between the > and the first space, e.g. in >x y z the id is x and the full name/description is x y z.

faSomeRecords only matches on the id, which is why filtering for hCov-19/South Africa and hCov-19/South Korea will yield the same results: it only matches on everything before the first space, or hCov-19/South. This causes the bug, because this workflow's data/strains_africa.txt file contains the full name/description for each sequence (as id isn't unique in this data).

seqkit grep also defaults to matching on the id just like faSomeRecords, but the -n (or --by-name) option changes it to match on the full name/description. As you suggested with your bug fix, that's indeed what we want.

Thanks for this! I'll cherry-pick your commit and merge.

tsibley commented 1 year ago

@AlbertRockG I've cherry-picked your fix and amended it a bit (for some further adjustments to the README and the commit message) to make https://github.com/nextstrain/ncov-africa-cdc/pull/8. That's merged now. Thanks again for this bug report and fix!

nextstrain / ncov-africa-cdc