steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
77 stars 7 forks source link

List of accession numbers for contamination in NR #2

Open pmenzel opened 4 years ago

pmenzel commented 4 years ago

Thanks for addressing the contamination issue (again)!

I am interested in cleaning up the NR database before using it for my application.
Would it be possible to also make available the list of the problematic accession numbers that you found in your contamination screen of the NR database?

thanks! Peter

martin-steinegger commented 4 years ago

Thank you @pmenzel I will upload the results from the NR to the FTP tomorrow.

martin-steinegger commented 4 years ago

Sorry for the delay. I have added the NR files to the ftp ftp://ftp.ccb.jhu.edu/pub/data/conterminator

There are two files (1) nr.ids.gz, which only contains the identfier and (2) nr.gz, which shows where the protein originates from.

pmenzel commented 4 years ago

Great thanks! NB, the nr.ids.gz file has 14149 entries, while the manuscript mentions 14132 predicted contaminant proteins.

martin-steinegger commented 4 years ago

Thank you for catching this! The reported number in the paper is from the kraken report. I have lost some entries while converting the conterminator result to a kraken output using krakenuniq-report (report nr.krakenreport.zip). This due to an inconsistency between the NCBI tax dumps used by conterminator and the kraken-uniq database. I am sorry for the confusion.

You have to be a bit careful with the protein predictions. In the protein case it is harder for conterminator to predict the directioinallity of the contamination correct. Because it can only use abundance information (for nucleotides databases it uses the length). But there is way to increase the precision by trading sensitivity if you only select 1 to N kingdom mappings. The contamination is then most likely the protein that occurs only once. You can select this case with the following awk command.

awk '(($2==1)+($3==1)+($4==1)+($5==1)+($6==1))==1 && (($2>1)+($3>1)+($4>1)+($5>1)+($6>1))==1 '  <(zcat nr.gz)
martin-steinegger commented 4 years ago

I will also upload a list of protein ids encoded on the short contamianted nucleotide contigs. I assume it might be useful for you to also remove these.

smsaladi commented 4 years ago

Thanks for this! Do you happen to have the Uniprot identifiers handy (7359 entries mentioned in the preprint)? This would be useful to me.

Also, as an aside, have you considered hosting these accession numbers and maybe a little metadata in a simple html table (say with jquery-datatables loaded)? Maybe with gh-pages. It would make the data a bit more discoverable -- and also accessible to non-computational folks.

XiongGZ commented 4 years ago

when i use it just stop in "Download taxdump.tar.gz", i had download taxdump.tar.gz from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. However i don't where to put it. Can you tell me what should i do please.

martin-steinegger commented 4 years ago

@xgz-98 Conterminator automatically downloads the taxdump from the NCBI site. You only need to provide a fasta file and the respective mapping from identifier to taxid.

XiongGZ commented 4 years ago

when i use command "conterminator dna example/dna.fna example/dna.mapping ${RESULT_PREFIX} tmp", it always stop in createtaxdb step.

Download taxdump.tar.gz tar: Skipping to next header 2020-04-18 14:36:42 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [51859296/51859296] -> "-" [7]

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error tar: Child returned status 1 tar: Error is not recoverable: exiting now Error: createtaxdb step died

i don't how to solve it.

martin-steinegger commented 4 years ago

@xgz-98 I have opened separate issue https://github.com/martin-steinegger/conterminator/issues/5