qichao1984 / NCyc

45 stars 22 forks source link

Script usage example #1

Closed rprops closed 5 years ago

rprops commented 6 years ago

Hi, Thanks for this great work! I'm interested in using it. Could you please provide a short example on how to use the NCycProfiler.PL script for profiling metagenomic sequence data (e.g., fasta of gene sequences). Thanks, Ruben

qichao1984 commented 6 years ago

Thanks for your interest in this work, and sorry for the frustration.! I have now added more detaied explanation for using this script in the readme file. Please let me know if you have any question.

rprops commented 6 years ago

Much appreciated, thank you!

Could you clarify which is the preferred approach for profiling a metagenome? The paper suggests both raw reads and gene sequences (nucleotide or protein) can be used as input? Which would be most accurate? I assume the protein sequence input? In addition is the approach also applicable to be used to profile individual genomes (reconstructed from the metagenomes)?

Thanks again!

rprops commented 6 years ago

Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map (e.g., 320102293) ?

Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files?

Thanks again.

qichao1984 commented 5 years ago

Much appreciated, thank you!

Could you clarify which is the preferred approach for profiling a metagenome? The paper suggests both raw reads and gene sequences (nucleotide or protein) can be used as input? Which would be most accurate? I assume the protein sequence input? In addition is the approach also applicable to be used to profile individual genomes (reconstructed from the metagenomes)?

Thanks again!

For large datasets, I would suggest to use DIAMOND. For accuracy, I agree that using predicted protein sequences should be better. Theoretically, the database should also work for individual genomes as long as N cycle genes exist.

qichao1984 commented 5 years ago

Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map (e.g., 320102293) ?

Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files?

Thanks again.

There are two steps may affect output profiles. The first one is alignment hits with non-N-Cycle gene families. And you are right, I did not include them in the id2gene.map file, because they are treated as false positives. The second one is random subsampling process, which will not include all hits in the output profile. In addition, there might be other issues that may lead to empty profiles, such as inconsistent sample names and different data format. Such issues could be corrected by modifying the input data and/or the script by the user. Since different users have different working environment, not all cases can be considered in the script. Therefore, NCycProfiler.PL should be used as a reference script that may need revisions according to different working environment.

liangjinsong commented 5 years ago

Also, while I can run the script, the output profile file is always empty, even though the diamond search yielded ~300 alignment hits for each file. Could this because in the reference database NCyc_100.faa there are sequences without a corresponding annotation in the mapping file id2gene.map (e.g., 320102293) ? Could you explain whether there is any post-processing that removes alignment hits prior to making the functional profile file? If not, I guess I can make my own profile file using the diamond output files? Thanks again.

There are two steps may affect output profiles. The first one is alignment hits with non-N-Cycle gene families. And you are right, I did not include them in the id2gene.map file, because they are treated as false positives. The second one is random subsampling process, which will not include all hits in the output profile. In addition, there might be other issues that may lead to empty profiles, such as inconsistent sample names and different data format. Such issues could be corrected by modifying the input data and/or the script by the user. Since different users have different working environment, not all cases can be considered in the script. Therefore, NCycProfiler.PL should be used as a reference script that may need revisions according to different working environment.

I was also bothered by the empty profile output file. Could you make it easier to follow the script? I think the problem mentioned above also bothered other users. Thanks!

qichao1984 commented 5 years ago

Made a few revisions to track potential errors when generating NCyc gene profiles. This may provide some clues when the problem was caused by output files or inconsistent file names. However, I still suggest to use this script as a reference as one really cannot predict all user's behavior in the complex metagenomic analysis procedure in one single script.

rprops commented 5 years ago

Issue is solved after removing file extensions from the file with sample sizes (i.e. sample1 instead of sample1.faa). please consider mentioning this in your script example or in the comments. Thanks again!