ncbi / sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Other
42 stars 5 forks source link

human kmer present in Escherichia coli reference genome #27

Closed mikelchtermans closed 6 months ago

mikelchtermans commented 7 months ago

Hi,

I used the aligns_to compiled tool as a standalone to scrub the Escherichia coli reference genome (https://www.ncbi.nlm.nih.gov/nuccore/NC_002695.2/) and found that there is one supposedly human kmer (using the -print_kmers_only flag) in the reference genome, namely CACCACCATTACCACCACCATCACCACCACCA . This kmer is also present in several alleles in a gene in Enterobase's E. coli cgMLST scheme, specifically in locus b0001 (https://enterobase.warwick.ac.uk/schemes/Escherichia.cgMLSTv1/b0001.fasta.gz) e.g. allele 2. This will lead to comparability issues if I decided to scrub raw reads before running a cgMLST analysis.

I wonder how it is possible that this kmer from a non-eukaryotic RefSeq genome is present in the human database, because according to the documentation on how the human kmer database is built, it should have been substracted;

Briefly, the HRRT employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records and subtracts any k-mers found in non-Eukaryota RefSeq records. The remaining set of k-mers compose the database used to identify human reads by the removal tool.

Kind regards, Michaël

multikengineer commented 6 months ago

Michaël,

Thank you for your diligent work! The kmer you identifed is found also (as an artifact from a BAC I imagine) in two different human reference genome chromosome sequences. I am surprised this kmer survived the final merging process, but I will take a closer look to ensure there isn't a problem. Meanwhile I can remover that kmer from the human_filter db and then you could update your db. Does that seem reasonable to you?

Respectfully, Ken

mikelchtermans commented 6 months ago

Hi Ken, thank you for the quick response! Your proposed actions do seem reasonable to me :) In the meantime I have also found 2 human kmers in the Neisseria meningitidis reference genome (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_008330805.1/): TAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGC and TTGCTTGCTTGCTTGCTTGCTTGCTTGCTTTC . These kmers are not present in its cgMLST schema as far as I can see.

On the bright side, I also did check a few other bacterial reference genomes from common diseases and have not found any human kmers in them.

Kind regards, Michaël

multikengineer commented 6 months ago

Michaël,

Again thank you for such diligent work! I have updated the db to new version 20231218v2

mikelchtermans commented 6 months ago

Works like a charm now, thank you!