ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

Error with running --taxcheck-only parameter #226

Closed Dx-wmc closed 1 year ago

Dx-wmc commented 1 year ago

Describe the bug When I update to the latest version, I use the same command to run the species identification, it will directly report an error. However, when I run the main program, it works. The command is the following: pgap.py S5_generic.yaml --taxcheck-only -r -o S5 --ignore-all-errors

image

To Reproduce I have run it with the same genomes successfully with the 2022-04-14 version

Software versions (please complete the following information):

Log Files cwltool.log

Additional context Add any other context about the problem here.

azat-badretdin commented 1 year ago

Thank you for your report, Dave!

The first indication to the problem is in this line in your log:

[2022-10-23 21:02:20] WARNING [step passdata] completed permanentFail

which indicates that some data was not installed. Please re-install it using --taxcheck parameter when following https://github.com/ncbi/pgap/wiki/Quick-Start

azat-badretdin commented 1 year ago

We had a similar problem in https://github.com/ncbi/pgap/issues/183 and it was fixed since then. For now, I am not sure you are experiencing this again.

Dx-wmc commented 1 year ago

Thank you for your report, Dave!

The first indication to the problem is in this line in your log:

[2022-10-23 21:02:20] WARNING [step passdata] completed permanentFail

which indicates that some data was not installed. Please re-install it using --taxcheck parameter when following https://github.com/ncbi/pgap/wiki/Quick-Start

Thanks for the advice, I re-downloaded the database and it worked, but I had some other problems:

  1. Setting the --cpus parameter seems to have no effect on the time of the result run. Currently, it takes about 1.5h to run a sample(6M) using the pgap pipeline (however, the version of 2022-04 only needs 40m).
  2. pgap is not as good as Prokka for some small contigs (when it is not long enough to annotate a CDS), mainly reflected in the pgap will be the default of the small contigs is considered to be part of an ORF, I think this is not accurate, whether you can give a parameter to control?
  3. The support of pgap for downstream software is not perfect. When I try to use the gff files generated by pgap as the input files of Roary software, I will always report an error. Can this be optimized?
  4. The files that pgap outputs are not perfect, for example, there are no ffn and faa files containing all CDS sequences (although it can be implemented by script, but the header will be missing). The currently provided faa file does not correspond to the CDS in the Gff file, and some of it will be missing, which I think has caused some trouble in the post-processing of studying the unknown protein sequence on the genome.

These questions I raised are some of the problems I encountered when using pgap. In addition, I strongly recommend that pgap be able to launch a quick annotation option (I. E., do not run precise functional annotations, only rough CDS annotation), because the time-consuming of pgap is a headache for running a large number of bacteria in batch.

Yours, Dave

azat-badretdin commented 1 year ago

The currently provided faa file does not correspond to the CDS in the Gff file,

GFF output and CDS features in the flat files include pseudo features which do not have corresponding protein sequences produced. This is by current design. Please let us know if you have examples of non-pseudo CDS in flatfile or/and GFF that are not in FAA output.

thibaudnis commented 1 year ago

Currently, it takes about 1.5h to run a sample(6M) using the pgap pipeline (however, the version of 2022-04 only needs 40m).

I am surprised that you see an increase in runtime (I assume using same CPU and mem?). We noticed no such deterioration in speed since April, for M. genitalium 37 (provided with PGAP as a test genome) and several other genomes that we regularly test our pipeline with.

mainly reflected in the pgap will be the default of the small contigs is considered to be part of an ORF, I think this is not accurate, whether you can give a parameter to control?

Our first recommendation would be to improve the quality of the assembly. Second, can you please provide some examples of what you see versus what you’d like to see? Is the issue that you’d rather have partial ORFs on small contigs than no annotation? There is currently no parameter to control this.

The support of pgap for downstream software is not perfect. When I try to use the gff files generated by pgap as the input files of Roary software, I will always report an error. Can this be optimized?

Thanks for the request. We will consider it. Our output complies with the GFF3 standards. One of the requirements of Roary is that the genome sequence be appended to the bottom of the GFF, which is a bit unusual.

The files that pgap outputs are not perfect, for example, there are no ffn and faa files containing all CDS sequences (although it can be implemented by script, but the header will be missing).

Please see Azat's request for details above. As for the ffn, we should be able to do this.

I strongly recommend that pgap be able to launch a quick annotation option (I. E., do not run precise functional annotations, only rough CDS annotation)

That's a good topic. We have discussed this internally. How you would define 'precise functional annotation'. Is no functional annotation okay? Is 20% more hypothetical proteins acceptable?

Thanks for all the feedback. This is very valuable and we'd love to talk. Please drop us a note at prokaryote-tools@ncbi.nlm.nih.gov if you are interested.

Dx-wmc commented 1 year ago

GFF output and CDS features in the flat files include pseudo features which do not have corresponding protein sequences produced. This is by current design. Please let us know if you have examples of non-pseudo CDS in flatfile or/and GFF that are not in FAA output.

Thank you for your answer. In fact, I used a line command (grep "CDS" *gff | less)to view the gff file before so I didn't pay attention to the problem of pseudo. I reviewed the complete gff file, and indeed the pseudo gene is not in the faa output. But one thing I'm curious about is how pgap defines pseudogene and the significance of it possible existence.

Dx-wmc commented 1 year ago

I am surprised that you see an increase in runtime (I assume using same CPU and mem?). We noticed no such deterioration in speed since April, for M. genitalium 37 (provided with PGAP as a test genome) and several other genomes that we regularly test our pipeline with.

On the issue of running time, I retested on two servers, using the same parameters and threads (pgap.py -D /usr/local/bin/singularity S5_generic.yaml --ignore-all-errors -r --no-internet --no-self-update -o 1028 --cpus 40) and I modified the number of ulimit -n> 8000. Still, obviously, their running times are inconsistent (the running time of the version from April 2022 to October 2022 is 42m, and the running time of the version from October 2022 to October 2022 is 59m). I don't know what causes this. Similarly, even if I adjust the thread to 60 or 100 of the time is basically the same, whether to say that pgap has the most suitable number of threads. That is, exceeding this value has no effect.

Our first recommendation would be to improve the quality of the assembly. Second, can you please provide some examples of what you see versus what you’d like to see? Is the issue that you’d rather have partial ORFs on small contigs than no annotation? There is currently no parameter to control this.

In this regard, the emergence of small contigs is inevitable because of the use of second-generation sequencing, but in some downstream analyses, these small contig count as part of the gene, I feel that it will affect the results of the analysis, so if want to use roary to analyze pangenome, I will still choose prokka and other software, but how to say, not the best use of pgap is a regret thing.

Thanks for the request. We will consider it. Our output complies with the GFF3 standards. One of the requirements of Roary is that the genome sequence be appended to the bottom of the GFF, which is a bit unusual.

It is true that the gff file contains sequences that are not common, but when I try to run the roary when I add the sequence to the gff file of the pgap, I still report an error and I can't find the corresponding reason, so I have a suggestion, for example, pgap keeps the existing gff file unchanged but adds a gff-for-analysis file (the format is suitable for software such as roary, and the pseudogenes are removed)

That's a good topic. We have discussed this internally. How you would define 'precise functional annotation'. Is no functional annotation okay? Is 20% more hypothetical proteins acceptable?

yes, rough annotation (that is, consistent with the final CDS area, but lacks full functionality) is acceptable, because if simple comments are developed, you can still run the complete pipeline for the parts you are interested in, but there is more to be done during the waiting time, rather than just waiting for the resulting output.

thibaudnis commented 1 year ago

pgap keeps the existing gff file unchanged but adds a gff-for-analysis file (the format is suitable for software such as roary, and the pseudogenes are removed)

I understand that the desired format should be something like:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region  1 1667867
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=210
seq1     Local   region  1       1667867 .       +       .       ID=seq1:1..1667867;Dbxref=taxon:210;Is_circular=true;gbkey=Src;mol_type=genomic DNA
seq1     .       gene    217     633     .       -       .       ID=gene-pgaptmp_000001;Name=nusB;gbkey=Gene;gene=nusB;gene_biotype=protein_coding;locus_tag
=pgaptmp_000001
seq1     Protein Homology        CDS     217     633     .       -       0       ID=cds-pgaptmp_000001;Parent=gene-pgaptmp_000001;Name=extdb:pgaptmp_000001;
gbkey=CDS;gene=nusB;inference=COORDINATES: similar to AA sequence:RefSeq:NP_206803.1;locus_tag=pgaptmp_000001;product=transcription antitermination factor NusB;protein_id=extdb:pgaptmp_000001;transl_table=11
seq1     .       gene    635     1105    .       -       .       ID=gene-pgaptmp_000002;Name=pgaptmp_000002;gbkey=Gene;gene_biotype=protein_coding;locus_tag=pgaptmp_000002
seq1     Protein Homology        CDS     635     1105    .       -       0       ID=cds-pgaptmp_000002;Parent=gene-pgaptmp_000002;Name=extdb:pgaptmp_000002;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:NP_206804.1;locus_tag=pgaptmp_000002;product=6%2C7-dimethyl-8-ribityllumazine synthase;protein_id=extdb:pgaptmp_000002;transl_table=11
[...]
##FASTA
>seq1
TGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGAT
TAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGT
GATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATT
[...]

We are working on producing such a file. Thanks for the suggestion.

azat-badretdin commented 1 year ago

Dave, can you test if Roary is able to process this file: test.roary.input.gff.txt Please remove .txt before testing...

Dx-wmc commented 1 year ago

Dave, can you test if Roary is able to process this file: test.roary.input.gff.txt Please remove .txt before testing...

Roary programme was performed using the gff file your provided and another gff file produced by prokka. This time the output is correct, but there is a disadvantage, the roary program appears a lot of Perl warning information, Use of uninitialized value within @cells in join or string at /cluster/home/zhangying/software/miniconda3/envs/bact_pangenome/lib/site_perl/5.26.2/Bio/Roary/ReformatInputGFFs.pm line 152, <$input_gff_fh> line 79869. Use of uninitialized value $cells[8] in split at /cluster/home/software/miniconda3/envs/bact_pangenome/lib/site_perl/5.26.2/Bio/Roary/ReformatInputGFFs.pm line 135, <$input_gff_fh> line 79870. Use of uninitialized value within @cells in join or string at /cluster/home/software/miniconda3/envs/bact_pangenome/lib/site_perl/5.26.2/Bio/Roary/ReformatInputGFFs.pm line 152, <$input_gff_fh> line 79870. Use of uninitialized value within @cells in join or string at /cluster/home/software/miniconda3/envs/bact_pangenome/lib/site_perl/5.26.2/Bio/Roary/ReformatInputGFFs.pm line 152, <$input_gff_fh> line 79870

I am not good at Perl language, I do not know if you can understand the meaning of the error message, at the same time, I looked at the file you provided, and found that there is a deficiency that pseudogenes and ribosomes still remain in CDS, which I think should be removed