Closed rprops closed 5 years ago
Thanks for your interest in this work, and sorry for the frustration.! I have now added more detaied explanation for using this script in the readme file. Please let me know if you have any question.
Much appreciated, thank you!
Could you clarify which is the preferred approach for profiling a metagenome? The paper suggests both raw reads and gene sequences (nucleotide or protein) can be used as input? Which would be most accurate? I assume the protein sequence input? In addition is the approach also applicable to be used to profile individual genomes (reconstructed from the metagenomes)?
Thanks again!
Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map
(e.g., 320102293
) ?
Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files?
Thanks again.
Much appreciated, thank you!
Could you clarify which is the preferred approach for profiling a metagenome? The paper suggests both raw reads and gene sequences (nucleotide or protein) can be used as input? Which would be most accurate? I assume the protein sequence input? In addition is the approach also applicable to be used to profile individual genomes (reconstructed from the metagenomes)?
Thanks again!
For large datasets, I would suggest to use DIAMOND. For accuracy, I agree that using predicted protein sequences should be better. Theoretically, the database should also work for individual genomes as long as N cycle genes exist.
Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map (e.g.,
320102293
) ?Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files?
Thanks again.
There are two steps may affect output profiles. The first one is alignment hits with non-N-Cycle gene families. And you are right, I did not include them in the id2gene.map file, because they are treated as false positives. The second one is random subsampling process, which will not include all hits in the output profile. In addition, there might be other issues that may lead to empty profiles, such as inconsistent sample names and different data format. Such issues could be corrected by modifying the input data and/or the script by the user. Since different users have different working environment, not all cases can be considered in the script. Therefore, NCycProfiler.PL should be used as a reference script that may need revisions according to different working environment.
Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map (e.g.,
320102293
) ? Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files? Thanks again.There are two steps may affect output profiles. The first one is alignment hits with non-N-Cycle gene families. And you are right, I did not include them in the id2gene.map file, because they are treated as false positives. The second one is random subsampling process, which will not include all hits in the output profile. In addition, there might be other issues that may lead to empty profiles, such as inconsistent sample names and different data format. Such issues could be corrected by modifying the input data and/or the script by the user. Since different users have different working environment, not all cases can be considered in the script. Therefore, NCycProfiler.PL should be used as a reference script that may need revisions according to different working environment.
I was also bothered by the empty profile output file. Could you make it easier to follow the script? I think the problem mentioned above also bothered other users. Thanks!
Made a few revisions to track potential errors when generating NCyc gene profiles. This may provide some clues when the problem was caused by output files or inconsistent file names. However, I still suggest to use this script as a reference as one really cannot predict all user's behavior in the complex metagenomic analysis procedure in one single script.
Issue is solved after removing file extensions from the file with sample sizes (i.e. sample1 instead of sample1.faa). please consider mentioning this in your script example or in the comments. Thanks again!
Hi, Thanks for this great work! I'm interested in using it. Could you please provide a short example on how to use the
NCycProfiler.PL
script for profiling metagenomic sequence data (e.g., fasta of gene sequences). Thanks, Ruben