minoh0201 / MetaCompare

MetaCompare
MIT License
17 stars 5 forks source link

[file]_CARD.txt is empty output #2

Closed mhyleung closed 4 years ago

mhyleung commented 6 years ago

Dear all

I have been playing with MetaCompare with some of my coassembly contigs fasta files and corresponding prodigal output files. The MetaCompare output consistently tells me that the [sample]_CARD.txt is empty, and would not give me a risk score.

I presume MetaCompare thinks that my contigs do not have ARGs matching the CARD database, but I find it difficult to think that my contigs do not have any ARG sequences, given that the same input file gives me many hits on the ResFam database using HMMER. Therefore, could this be due to some other setting issues that does not allow MetaCompare to detect ARGs on my contigs files?

I have no problem running the example files, and my contigs file has hits to the ACARD and PATRIC databases.

Thank you

Marcus

minoh0201 commented 6 years ago

Hi Marcus,

MetaCompare checks the file size of [filename]_CARD.txt before reading it and when the file size is zero, MetaComapre will print an error "Warning: [filename]_CARD.txt is empty." So, I'm pretty sure your [filename]_CARD.txt contains nothing.

I suggest you try the followings:

1) If you are getting "BLAST Database error: No alias or index file found for protein database [./BlastDB/CARD_PROT] in search path [/user_path_to_save_metacompare/MetaCompare::]" before empty warning, you need to delete BlastDB folder and follow the step3 of "installing" in README.md.

2) If you are only getting "Warning: [filename]_CARD.txt is empty.", then remove "[filename]_CARD.txt" file and re-run MetaCompare to reproduce it.

3) if you are still getting the same error, 3-1) check [filename]_CARD.txt contains any information as like other annotation files (e.g. [filename]_ACLAME.txt and [filename]_PATRIC.txt). If the file is empty than it is not able to calculate a risk score. MetaCompare considers no ARG as no risk. 3-2) Make sure your sample has sufficient contigs derived from the assembly so that Blast annotation could capture ARG sequences. (Quick and easy solution is uploading your sample to MetaStorm and run assembly pipeline and check whether any function hits against CARD database is exist or not.) 3-3) Send me your sample then I can look at it. (Contact: minoh at vt dot edu)

Thanks for sharing your experience. Please do not hesitate to let me know further concerns.

Min

jllavin77 commented 4 years ago

Dear @minoh0201 ,

I have found a similar error to the one opening this post while Metacompare was running BLAST on the CARD database:

Warning: (1431.1) CFastaReader: Ignoring invalid residue 9 at line 219765453, position 16
Warning: (1431.1) CFastaReader: Ignoring invalid residue 4 at line 219765453, position 17
Warning: (1431.1) CFastaReader: Ignoring invalid residue _ at line 219765453, position 18
Warning: (1431.1) CFastaReader: Ignoring invalid residue 1 at line 219765453, position 19
Warning: (1431.1) CFastaReader: Ignoring invalid residue / at line 219765454, position 0
Warning: (1431.1) CFastaReader: Ignoring invalid residue / at line 219765454, position 1
Running blastn on PATRIC
Reading files..
Computing resistome risk score..
Warning: scaffold_CARD.txt is empty.

Here are the commands I used for the whole workflow:

java -jar /home/Analysis/Software/Trimmomatic-0.39/trimmomatic-0.39.jar PE ${infile} ${base}_2.fastq.gz \
                ${base}_1.trim.fastq.gz ${base}_1_un.trim.fastq.gz \
                ${base}_2.trim.fastq.gz ${base}_2_un.trim.fastq.gz \
                SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 
#Unzip fastqs to use them as fq2fa input
  gunzip -c ${base}_1.trim.fastq.gz > ${base}_1.trim.fastq 
  gunzip -c ${base}_2.trim.fastq.gz > ${base}_2.trim.fastq 
#Merge fastqs and conver into FASTA file for idba valid input
  fq2fa --merge --filter ${base}_1.trim.fastq ${base}_2.trim.fastq ${base}_mgd.fa
#Run idba for assembly  
idba_ud -r ${base}_mgd.fa --num_threads 8 -o ${base} rendimiento
#Run prodigal for gene detection  
prodigal -i ${base}_mgd.fa -o ${base}_mgd.genes -a ${base}_mgd.faa -p meta
#Run Metacompare 
# Inputs: contig-20.fa file resulting from idba and the .genes file obtained from prodigal
  python /home/Analysis/Software/MetaCompare/metacmp.py -c contig-20.fa -g ${base}_mgd.genes -t 48

Is there any problem with the CARD database you supply in your "BlastDB.tar.gz"? Could it be some kind of character encoding problem? I ask that because BLAST spits all those warnings (I showed befores), and the result is an empty file...

I used a workstation with CentOS 7, 192 Gb of RAM and 4 Xeon processors with 64 threads available.

Thanks in advance for your kind attention.

minoh0201 commented 4 years ago

Hi @jllavin77 ,

Those are not errors but warnings. Your Fasta file contains unexpected characters and those characters were ignored when BLASTP is running. That is what the message "Warning: (1431.1) CFastaReader: Ignoring invalid residue 9 at line 219765453, position 16" is saying.

Also, if there is no annotation derived from CARD DB, it is normal to have the warning message "Warning: scaffold_CARD.txt is empty", unless you see this warning with the given sample (S1.fa). In case of not having any ARG annotation, no risk score is expected.

jllavin77 commented 4 years ago

Hi @minoh0201,

I'll try to address your answers:

Those are not errors but warnings. Your Fasta file contains unexpected characters and those characters were ignored when BLASTP is running. That is what the message "Warning: (1431.1) CFastaReader: Ignoring invalid residue 9 at line 219765453, position 16" is saying.

I agree those are warnings, but those warnings end up in an error on the tool run. The sequences submitted to BLAST in this step are nucleotides (they come from the assembly of contigs) so we would expect to have a BLASTx running, not a BLASTp. Please, correct me if I'm wrong.

Also, if there is no annotation derived from CARD DB, it is normal to have the warning message "Warning: scaffold_CARD.txt is empty" unless you see this warning with the given sample (S1.fa). In case of not having any ARG annotation, no risk score is expected.

To test this hypothesis, I downloaded the CARD database (protein FASTA file) and generated a BLASTdb with it. Then I used my scaffolds file to perform a BLASTx search against it to see if I have any sequence matching the database. I do have positive BLAST results as I'll show later. The thing is that my BLAST has a custom output format that does not match your out format (at least that of the PATRIC db). So, I have resistance genes in my samples that your tool is unable to detect while BLASTing the CARD database (I'm not the only one affected by this issue). Could you please revise the BLAST call from this particular step in your code or tell me if I've been wrong in any step of my debugging.

Example of my CARD results:

BLASTX 2.9.0+ Query: scaffold_0 Database: /home/Analysis/Software/MetaCompare/BlastDB/CARD_PROT Fields: query acc., subject acc., alignment length, % query coverage per subject, evalue, score, subject title 1575 hits found scaffold_0 gb|CAC14595.1|ARO:3003056|smeE 1041 1 0.0 3541 gb|CAC14595.1|ARO:3003056|smeE [Stenotrophomonas maltophilia] scaffold_0 gb|CAC14595.1|ARO:3003056|smeE 1019 1 2.79e-116 1028 gb|CAC14595.1|ARO:3003056|smeE [Stenotrophomonas maltophilia] scaffold_0 gb|CAC14595.1|ARO:3003056|smeE 1053 1 2.96e-68 653 gb|CAC14595.1|ARO:3003056|smeE [Stenotrophomonas maltophilia] scaffold_0 gb|CAC14595.1|ARO:3003056|smeE 1026 1 9.71e-49 494 gb|CAC14595.1|ARO:3003056|smeE [Stenotrophomonas maltophilia] scaffold_0 gb|AAC76298.1|ARO:3000502|acrF 1018 1 0.0 3277 gb|AAC76298.1|ARO:3000502|acrF [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC76298.1|ARO:3000502|acrF 1031 1 1.18e-117 1038 gb|AAC76298.1|ARO:3000502|acrF [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC76298.1|ARO:3000502|acrF 1035 1 2.05e-66 638 gb|AAC76298.1|ARO:3000502|acrF [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC76298.1|ARO:3000502|acrF 1017 1 6.93e-56 552 gb|AAC76298.1|ARO:3000502|acrF [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC73564.1|ARO:3000216|acrB 1019 1 0.0 3263 gb|AAC73564.1|ARO:3000216|acrB [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC73564.1|ARO:3000216|acrB 1031 1 1.86e-115 1023 gb|AAC73564.1|ARO:3000216|acrB [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC73564.1|ARO:3000216|acrB 1055 1 2.07e-75 710 gb|AAC73564.1|ARO:3000216|acrB [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAC73564.1|ARO:3000216|acrB 1013 1 8.89e-52 519 gb|AAC73564.1|ARO:3000216|acrB [Escherichia coli str. K-12 substr. MG1655] scaffold_0 gb|AAA74437.1|ARO:3000378|MexB 1045 1 0.0 3192 gb|AAA74437.1|ARO:3000378|MexB [Pseudomonas aeruginosa] scaffold_0 gb|AAA74437.1|ARO:3000378|MexB 1033 1 4.23e-122 1073 gb|AAA74437.1|ARO:3000378|MexB [Pseudomonas aeruginosa] scaffold_0 gb|AAA74437.1|ARO:3000378|MexB 1011 1 3.26e-52 522 gb|AAA74437.1|ARO:3000378|MexB [Pseudomonas aeruginosa] scaffold_0 gb|AAA74437.1|ARO:3000378|MexB 641 1 1.06e-51 518 gb|AAA74437.1|ARO:3000378|MexB [Pseudomonas aeruginosa]

Thank you in advance for your kind help.

Best wishes

JL

minoh0201 commented 4 years ago

@jllavin77

  1. Yes, you are right. It is Blastx.
  2. CARD DB provided in MetaCompare would be an older version (v1.0.6) than one that you downloaded. When you use up-to-date DB, it is more likely to detect more sequences.
  3. Try to use "-outfmt 6" flag in Blast command. (e.g. "blastn -db [your db file] -query S1_genes.fa -out S1_CARD.txt -outfmt 6 -num_threads 64 -evalue 1e-10") *Note that input query should be a prodigal output.
jllavin77 commented 4 years ago

Dear @minoh0201, I was wrong as the input submitted to BLAST against the CARD database is a protein FASTA, so the BLAST should be BLASTp, as you said before. But now I have another question, why not using the UDBA output used to BLAST the other two databases, blasting CARD via BLASTX instead of having to generate the Prodigal protein FASTA and then BLASTp CARD db? I have tested this option myself, and the results are analogous to those of the BLASTp. This way, you may simplify the use of Metacompare by reducing the number of inputs required.

Another thing I'd like to point out is that the results of the tool are written only as STDOUT in the screen...would it be possible to add a piece of code so that the final risk calculation outputs as a separate text file? I highlight this because when you run several samples in a batch, if the program spits a warning, you completely lose track of the resistome risk scores for each sample...

minoh0201 commented 4 years ago

MetaCompare was designed to serve Metagenomic sequencing data which contains nucleotide sequences. IDBA-UD output also is supposed to be nucleotide sequences. Hence, Prodigal is introduced to predict coding-gene regions to do the subsequent BLASTX. I think your usage is a little bit off from the intention but it makes sense to skip Prodigal and use BLASTP if your FASTA file somehow contains amino acid sequences. My apology for not providing any further updates. You may folk the repo and customize as you wish.