marcelauliano / MitoHiFi

Find, circularise and annotate mitogenome from PacBio assemblies
MIT License
169 stars 28 forks source link

Preparing small example files to run it for a short time. #30

Open junaruga opened 2 years ago

junaruga commented 2 years ago

This issue comes from the https://github.com/marcelauliano/MitoHiFi/issues/26#issuecomment-1282343446 . This is a proposal for enhancement.

I executed the mitohifi.py -r <reads file> ... with this repository's example files. And it takes the "4500.39 seconds" = 75 minutes. It is great if this repository has small example data files to finish the command for a short time.

$ docker run --rm -w /data/ -v /home/jaruga/tmp/mitohifi/exampleFiles/:/data/ -t docker.io/biocontainers/mitohifi:2.2_cv1 mitohifi.py -r /data/ilDeiPorc1.reads.fa -f /data/MW539688.1.fasta -g /data/MW539688.1.gb -t 4 -o 2
...
2022-10-18 12:35:21 [INFO] Pipeline finished!
2022-10-18 12:35:21 [INFO] Run time: 4500.39 seconds
junaruga commented 2 years ago

According to the full log of running the minohifi.py with the example files. I see many configs are processed. Perhaps, we can reduce the number of the contigs in the file to reduce the total running time? https://gist.github.com/junaruga/b7ebdc41df63a3b041c5ae53797a1a29#file-mitohifi_with_example_data_reads_file-log-L42-L422

junaruga commented 2 years ago

Seeing the log above with the example files. I think the performance tuning point is step 7. https://gist.github.com/junaruga/b7ebdc41df63a3b041c5ae53797a1a29#file-mitohifi_with_example_data_reads_file-log-L42-L1561

The step took 67 minutes in the total running time 75 minutes.

2022-10-18 11:22:29 [INFO] 6. Now we are going to circularize, annotate and rotate each filtered contig. Those are potential mitogenome(s).
...
Gene CYTB contains frameshift
2022-10-18 12:30:20 [INFO] 7. Now the rotated contigs will be aligned
junaruga commented 2 years ago

I created a small reads FASTA file ilDeiPorc1.reads.small.fa that is just the first 20045 lines of the ilDeiPorc1.reads.fa. The data is 10% size of the current example file. It creates only 1 conitg. And it took 5 minutes 31 seconds. The full log is here.

$ wc -l ilDeiPorc1.reads.*
  244682 ilDeiPorc1.reads.fa
   20045 ilDeiPorc1.reads.small.fa
  264727 total

$ time docker run --rm -w /data/ -v /home/jaruga/tmp/mitohifi/exampleFiles/:/data/ -t docker.io/biocontainers/mitohifi:2.2_cv1 mitohifi.py -r /data/ilDeiPorc1.reads.small.fa -f /data/MW539688.1.fasta -g /data/MW539688.1.gb -t 4 -o 2 /data/ -v /home/jaruga/tmp/mitohifi/exampleFiles/:/data/ -t docker.io/biocontainers/mitohifi:2.2_cv1 mitohifi.py -r /data/ilDeiPorc1.reads.small.fa -f /data/MW539688.1.fasta -g /dat2022-10-25 16:33:56 [INFO] Welcome to MitoHifi v2. Starting pipeline...
2022-10-25 16:33:56 [INFO] Length of related mitogenome is: 15354 bp
2022-10-25 16:33:56 [INFO] Number of genes on related mitogenome: 37
...
2022-10-25 16:34:06 [INFO] 6. Now we are going to circularize, annotate and rotate each filtered contig. Those are potential mitogenome(s).
2022-10-25 16:34:06 [INFO] Working with contig ptg000001l
2022-10-25 16:34:06 [INFO] Started ptg000001l circularization
2022-10-25 16:34:07 [INFO] ptg000001l circularization done. Circularization info saved on ./potential_contigs/ptg000001l/ptg000001l.circularisationCheck.txt
2022-10-25 16:34:07 [INFO] Started ptg000001l (MitoFinder) annotation
2022-10-25 16:36:52 [INFO] ptg000001l annotation done. Annotation log saved on ./potential_contigs/ptg000001l/ptg000001l.annotation_MitoFinder.log
2022-10-25 16:36:52 [INFO] Started ptg000001l rotation.
2022-10-25 16:36:52 [INFO] Rotation of ptg000001l done. Rotated is at ptg000001l.mitogenome.rotated.fa
...
2022-10-25 16:39:24 [INFO] Pipeline finished!
2022-10-25 16:39:24 [INFO] Run time: 328.83 seconds

real    5m31.986s
user    0m0.048s
sys 0m0.032s
junaruga commented 2 years ago

I tested the contigs file case. The running time was short. The full log is here.

$ time docker run --rm -w /data/ -v /home/jaruga/tmp/mitohifi/exampleFiles/:/data/ -t docker.io/biocontainers/mitohifi:2.2_cv1 mitohifi.py -c /data/test.fa -f /data/NC_016067.1.fasta -g /data/NC_016067.1.gb -t 4 -o 2
...
2022-10-25 20:06:23 [INFO] Pipeline finished!
2022-10-25 20:06:23 [INFO] Run time: 338.59 seconds

real    5m41.835s
user    0m0.044s
sys 0m0.032s