jeffersonfparil / compare_genomes

A comparative genomics workflow using Nextflow, conda, Julia and R
GNU General Public License v3.0
34 stars 7 forks source link

Error executing process > 'PLOT' #10

Open StepanSaenko opened 1 year ago

StepanSaenko commented 1 year ago

Hello! I tried to run compare_genomes on several species, got an error

Error in file(file, "r") : cannot open the connection Calls: read.nexus -> scan -> file In addition: Warning message: In file(file, "r") : cannot open file 'ORTHOGROUPS_SINGLE_GENE.NT.timetree.nex': No such file or directory Execution halted

Also, most of the output files (e.g. expanded_orthogroup) are empty. Is there something wrong with my files?

P.S. Panther17 is no longer available, will it be better to change from v17 to v18 everywhere?

Thank you.

jeffersonfparil commented 1 year ago

Hi Stepan, Have you tried this? Also, thanks for letting me know that. I'll look into updating Panther version 17 to 18.

StepanSaenko commented 1 year ago

Thank you for your reply. I reduced the number of species and dates, but the error still appears. I got 3 species now:

Bombyx_mori,Drosophila_melanogaster -286.000                                                                                                                                                             
Drosophila_melanogaster,Tribolium_castaneum -333.000                                                                                                                                                     
Tribolium_castaneum,Bombyx_mori -333.000  

Also, the one tiny problem is here: even if the panther database archive was downloaded before, [14/545db0] process > DOWNLOAD_PANTHER_DATABASE [ 0%] 0 of 1 is running. Ctrl+C aborts this part, but every time it tries to download.

StepanSaenko commented 1 year ago

On my data all the steps seem to be suspiciously fast. It makes me suggest the problem is my data. I'm sorry for disturbing you.

jeffersonfparil commented 1 year ago

HI Stepan, No worries.

Please, don't hesitate to ask if you get stuck again.

StepanSaenko commented 1 year ago

Could you please try the workflow using my data?

urls.txt

Tenebrio_molitor.fna,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/166/875/GCA_907166875.3_Tenebrio_molitor_v3/GCA_907166875.3_Tenebrio_molitor_v3_genomic.fna.gz Tenebrio_molitor.gff,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/166/875/GCA_907166875.3_Tenebrio_molitor_v3/GCA_907166875.3_Tenebrio_molitor_v3_genomic.gff.gz Tenebrio_molitor.faa,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/166/875/GCA_907166875.3_Tenebrio_molitor_v3/GCA_907166875.3_Tenebrio_molitor_v3_protein.faa.gz Tenebrio_molitor.cds,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/166/875/GCA_907166875.3_Tenebrio_molitor_v3/GCA_907166875.3_Tenebrio_molitor_v3_cds_from_genomic.fna.gz Tribolium_castaneum.fna,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/335/GCF_000002335.3_Tcas5.2/GCF_000002335.3_Tcas5.2_genomic.fna.gz Tribolium_castaneum.gff,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/335/GCF_000002335.3_Tcas5.2/GCF_000002335.3_Tcas5.2_genomic.gff.gz Tribolium_castaneum.faa,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/335/GCF_000002335.3_Tcas5.2/GCF_000002335.3_Tcas5.2_protein.faa.gz Tribolium_castaneum.cds,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/335/GCF_000002335.3_Tcas5.2/GCF_000002335.3_Tcas5.2_cds_from_genomic.fna.gz Tribolium_madens.fna,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/345/945/GCF_015345945.1_Tmad_KSU_1.1/GCF_015345945.1_Tmad_KSU_1.1_genomic.fna.gz Tribolium_madens.gff,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/345/945/GCF_015345945.1_Tmad_KSU_1.1/GCF_015345945.1_Tmad_KSU_1.1_genomic.gff.gz Tribolium_madens.faa,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/345/945/GCF_015345945.1_Tmad_KSU_1.1/GCF_015345945.1_Tmad_KSU_1.1_protein.faa.gz Tribolium_madens.cds,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/345/945/GCF_015345945.1_Tmad_KSU_1.1/GCF_015345945.1_Tmad_KSU_1.1_cds_from_genomic.fna.gz Zophobas_morio.fna,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/027/724/725/GCA_027724725.1_ASM2772472v1/GCA_027724725.1_ASM2772472v1_genomic.fna.gz Zophobas_morio.gff,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/027/724/725/GCA_027724725.1_ASM2772472v1/GCA_027724725.1_ASM2772472v1_genomic.gff.gz Zophobas_morio.faa,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/027/724/725/GCA_027724725.1_ASM2772472v1/GCA_027724725.1_ASM2772472v1_protein.faa.gz Zophobas_morio.cds,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/027/724/725/GCA_027724725.1_ASM2772472v1/GCA_027724725.1_ASM2772472v1_cds_from_genomic.fna.gz Dendroctonus_ponderosae.fna,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/466/585/GCF_020466585.1_Dpon_F_20191213v2/GCF_020466585.1_Dpon_F_20191213v2_genomic.fna.gz Dendroctonus_ponderosae.gff,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/466/585/GCF_020466585.1_Dpon_F_20191213v2/GCF_020466585.1_Dpon_F_20191213v2_genomic.gff.gz Dendroctonus_ponderosae.faa,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/466/585/GCF_020466585.1_Dpon_F_20191213v2/GCF_020466585.1_Dpon_F_20191213v2_protein.faa.gz Dendroctonus_ponderosae.cds,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/466/585/GCF_020466585.1_Dpon_F_20191213v2/GCF_020466585.1_Dpon_F_20191213v2_cds_from_genomic.fna.gz

dates.txt

Tribolium_castaneum,Tenebrio_molitor -172.000 Dendroctonus_ponderosae,Tribolium_castaneum -215.000

venn_species_max_5.txt

Tribolium castaneum Tenebrio molitor Dendroctonus ponderosae Tribolium madens Zophobas morio

I changed the dataset and still got the error

jeffersonfparil commented 1 year ago

Hi Stepan, Is it the same exact error message? Can you also post your params.config? I'll run it as soon as I'm free.

StepanSaenko commented 1 year ago

Of course:

params { dir = '/home/saenkos/comparing3' species_of_interest = 'Tribolium_castaneum' species_of_interest_panther_HMM_for_gene_names_url = 'http://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/PTHR18.0_tribolium' panther_hmm_database_location = '/home/saenkos/comparing3/PantherHMM_18.0' urls = "${projectDir}/urls.txt" dates = "${projectDir}/dates.txt" comparisons_4DTv = "${projectDir}/comparisons_4DTv.txt" venn_species_max_5 = "${projectDir}/venn_species_max_5.txt" genes = "${projectDir}/genes.txt" cafe5_n_gamma_cats = 1 // If 1 then use the base model; else use the gamma model with <cafe5_n_gamma_cats> gamma categories to test cafe5_pvalue = 0.01 go_term_enrich_genome_id = 7070 // go_term_enrich_annotation_id = "GO:0008150" go_term_enrich_test = "FISHER" go_term_enrich_correction = "FDR" go_term_enrich_ngenes_per_test = 100 go_term_enrich_ntests = 5 } includeConfig 'process.config'

The same error cannot open file 'ORTHOGROUPS_SINGLE_GENE.NT.timetree.nex': No such file or directory

jeffersonfparil commented 1 year ago

Can youplease try version 17.0 panther databases first, i.e. use the following in params.config:

species_of_interest_panther_HMM_for_gene_names_url = 'http://data.pantherdb.org/ftp/sequence_classifications/17.0/PANTHER_Sequence_Classification_files/PTHR17.0_arabidopsis'

panther_hmm_database_location = 'http://data.pantherdb.org/ftp/panther_library/17.0/PANTHER17.0_hmmscoring.tgz'

StepanSaenko commented 1 year ago

Changed, the same result. But for the test data it works using v18. Could be the distances between species too far? Or maybe it happened because I've left genes.txt empty?

jeffersonfparil commented 1 year ago

Hi Stepan, Thanks for trying both versions. Have you looked at the detailed error messsges on compare_genomes/work/<most_recent_folders>/<hash_signature>/? The files are hidden, i.e. .command.sh.command.log and .command.err.

Empty genes.txt should not affect the time tree. The divergence time might be an issue if it clashes with what the sequence differences show.

StepanSaenko commented 1 year ago

I tried to look through the last .err and .log, there are two odd errors:

1)cut: /home/saenkos/compare_genomes/modules/genes.txt: No such file or directory idk why there is /modules/ directory 2) conda 23.7.3 requires requests<3,>=2.27.0, but you have requests 2.22.0 which is incompatible. also this is strange, because the test files work

All the other errors seem to be in a cycle

Did you try to run the workflow on my files?

jeffersonfparil commented 1 year ago

Oh, I just noticed that the paths in your params.config may not be pointing to the correct locations. Are your input url, dates, etc lists inside compare_genomes/config/? If so, then the locations in params.config should be like urls = "${projectDir}/../config/urls.txt" and not just urls = "${projectDir}/urls.txt".

Also, all my VMs are currently busy with other analyses and jobs running for the next few weeks.

StepanSaenko commented 1 year ago

Well, I reinstalled the whole workflow and it seems to be working. I'm afraid I spent your time. Anyway, thank you for your responses and time. I will inform you, if you are interested.

jeffersonfparil commented 1 year ago

Yes, please let me know if the issue persists otherwise, I'll close this issue.

No, worries, I set the wrong directories all the time, and if everything else fails, the good 'ol turning it off and on again can help.

StepanSaenko commented 1 year ago

I am really sorry, but after 14 hours the process stopped and the error is here again.

I checked .command.err files which were generated only for the last run in the just-created copy of the workflow:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.adjustScoreMatrix(ProfileAligner.java:711)
        at align.ProfileAligner.alignProfiles(ProfileAligner.java:183)
        at align.CodingMSA.buildAlignement(CodingMSA.java:615)
        at align.CodingMSA.buildProfile(CodingMSA.java:510)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:650)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0006401.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0006401.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0006401*.tmp': No such file or directory
rm: cannot remove 'OG0006401.AA.prot': No such file or directory
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.<init>(ProfileAligner.java:120)
        at align.CodingMSA.<init>(CodingMSA.java:64)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:633)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0008703.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0008703.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0008703*.tmp': No such file or directory
rm: cannot remove 'OG0008703.AA.prot': No such file or directory

then ValueError: invalid mode: 'rU' , but I have already met this error, it was caused by Python 3.11

The solution is using Python3.11 and removing the "U" from the function in input.py.

the next one

ERROR: LoadError: SystemError: opening file "CDS/OG0000762.cds": No such file or directory
Stacktrace:
 [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
   @ Base ./error.jl:176
 [2] #systemerror#80
   @ ./error.jl:175 [inlined]
 [3] systemerror
   @ ./error.jl:175 [inlined]
 [4] open(fname::String; lock::Bool, read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Nothing)
   @ Base ./iostream.jl:293
 [5] open(fname::String, mode::String; lock::Bool)
   @ Base ./iostream.jl:356
 [6] open(fname::String, mode::String)
   @ Base ./iostream.jl:355
 [7] top-level scope
   @ ~/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
in expression starting at /home/saenkos/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
ls: cannot access 'OG0000762-*.fasta': No such file or directory
fasta file:OG0000762.fasta not found
java.io.FileNotFoundException: OG0000762.fasta (No such file or directory)

then, but I am not sure this is really an error

signal (15): Terminated in expression starting at /home/saenkos/compare_my/compare_genomes/config/install_julia_packages.jl:1 ijl_uncompress_ir at /usr/local/src/conda/julia-1.8.3/src/ircode.c:862 InliningTodo at ./compiler/ssair/inlining.jl:870 [inlined] resolve_todo at ./compiler/ssair/inlining.jl:804 analyze_method! at ./compiler/ssair/inlining.jl:861 handle_match! at ./compiler/ssair/inlining.jl:1293 analyze_single_call! at ./compiler/ssair/inlining.jl:1210 assemble_inline_todo! at ./compiler/ssair/inlining.jl:1425 ssa_inlining_pass! at ./compiler/ssair/inlining.jl:82 jfptr_ssa_inlining_passNOT._16094.clone_1 at /home/saenkos/anaconda3/envs/myenv/envs/compare_genomes/lib/julia/sys.so (unknown line) Also, I get the up-to-date link to V17 Classification, because it is not a current release anymore.

http://data.pantherdb.org/ftp/sequence_classifications/17.0/PANTHER_Sequence_Classification_files/ in modules/setup.nf

So, I'm going to fix some of them and run the workflow one more time increasing CPUs and memory limits. Now it seems to be better than before.

jeffersonfparil commented 1 year ago

Sound like a good plan. Yes, you're just running out of memory it seems, and the subsequent error messages are just because the previous step did not generate the cds alignments it was expecting.

StepanSaenko commented 1 year ago

So, I increased the memory size to 160GB and 64 CPUs.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.adjustScoreMatrix(ProfileAligner.java:711)
        at align.ProfileAligner.alignProfiles(ProfileAligner.java:183)
        at align.CodingMSA.buildAlignement(CodingMSA.java:615)
        at align.CodingMSA.buildProfile(CodingMSA.java:510)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:650)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)

How is it possible? I have only 5 genomes ~160M in length each.

jeffersonfparil commented 1 year ago

Did you increase the memory in process.config accordingly? If you did, then the memory may be getting stretched thin across the cpus. Try reducing the number of cpus in process.config to give each core more memory.

StepanSaenko commented 1 year ago

I reduced the number of cpus (24), and now diamond blastp has been running for 50 hours. Have you ever seen such a thing?

jeffersonfparil commented 1 year ago

Yes, some of the plant genomes I've dealt with took more than a week to finish the whole workflow. Are you at assess_specific_genes.nf?

StepanSaenko commented 1 year ago

I'm on

executor >  local (2)
[09/f8b172] process > FIND_ORTHOGROUPS               [100%] 1 of 1 ✔
[b5/27007a] process > ASSIGN_GENE_FAMILIES_TO_ORT... [  0%] 0 of 1
[-        ] process > ASSESS_ORTHOGROUPS_DISTRIBU... -

File orthogroups.faa was created 8 hours ago. But, unfortunately, I should restart the workflow because our Slurm management system on the HPC provides only 72 hours.

jeffersonfparil commented 1 year ago

You can run each module separately and you may even go into each module and extract portions of the shell scripts so you can run them individually. That should give you an even more finer control over the whole workflow and should allow you to work around the 75 hour max run time in you HPC.

StepanSaenko commented 1 year ago

Well, I am still trying to move on: Before I got the error:

ERROR: LoadError: SystemError: opening file "CDS/putative.cds": No such file or directory

Have you got such an error from iqtree2 ?

IQ-TREE multicore version 2.2.0.3 COVID-edition for Linux 64-bit built Aug  2 2022
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.

Host:    node362 (AVX512, FMA3, 187 GB RAM)
Command: iqtree2 -s ORTHOGROUPS_SINGLE_GENE.NT.aln -p alignment_parition.NT.nex -T 20 --date /home/saenkos/compare_my/compare_genomes/modules/../config/dates.txt --date-tip 0 --prefix ORTHOGROUPS_SINGLE_GENE.NT --redo
Seed:    886971 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Wed Nov  1 08:23:09 2023
Kernel:  AVX+FMA - 20 threads (32 CPU cores detected)

Reading partition model file alignment_parition.NT.nex ...
Reading alignment file ORTHOGROUPS_SINGLE_GENE.NT.aln ... Fasta format detected
Reading fasta file: done in 1.75486 secs using 91.45% CPU
ERROR: Sequence Tribolium_castaneum contains too many characters (16349019)
ERROR: Sequence Tribolium_madens contains too many characters (108930093)
ERROR: 
jeffersonfparil commented 1 year ago

For some reason the alignments does not seem to have the same lengths. Can you look at the alignment file? Maybe they have been concatenated twice during the course of the retries. That maybe something we can fix/add to account for multiple failing reruns.

By the way, I have ran coleopteran genomes in the past including Tribolium castaneum and can confirm that the genome at least the NCBI genome and predicted proteins should not give us any problems.

StepanSaenko commented 1 year ago

Could you please share the initial *.txt files? My email is saenkos@uni-greifswald.de

jeffersonfparil commented 1 year ago

I'll send the config files to you as soon as I get access to the VM I ran it on. But for now, I'm running the workflow on one of my VMs using the config files you've previously sent.

jeffersonfparil commented 1 year ago

I think I found the issue with plotting, and have committed the fix. Although I'm running the entire workflow from the beginning to validate that it works. The issue is with the names of sequences in the CDS of Tribolium castaneum where we have duplicated sequence names and therefore getting two sequences when we expect one which prevents IQTREE from building a tree as the alignments across species do not match. Also note that I have made changes with params.config where I have deharcoded the PantherHMM classifications text file, i.e. I've added panther_hmm_classifications_location = 'http://data.pantherdb.org/ftp/hmm_classifications/17.0/PANTHER17.0_HMM_classifications' to make things simpler when changing Panther versions. I'll let you know if the fix succeeds, then I'll send you the config files I used.

jeffersonfparil commented 1 year ago

Validated the fix on my end (32-core machine with 120 GB RAM which ran for ~5 hours and 14 minutes). Please see the Coleopterans branch for the config files, you may also just clone that branch. Please let me know if it worked for you too.

Here's the summary figure I got: Test_fig_coleo

StepanSaenko commented 1 year ago

This is wonderful, thank you very much. And you did not have any Java out-of-memory error?

jeffersonfparil commented 1 year ago

I had no memory issues. Maybe the shared computing/submission system you have has some quirks with how memory is managed with java. I hope it'll be smooth this time around. Good luck with your analyses. Very interesting patterns of contraction/expansion of gene families!

gdmdxl commented 7 months ago

I am really sorry, but after 14 hours the process stopped and the error is here again.

I checked .command.err files which were generated only for the last run in the just-created copy of the workflow:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.adjustScoreMatrix(ProfileAligner.java:711)
        at align.ProfileAligner.alignProfiles(ProfileAligner.java:183)
        at align.CodingMSA.buildAlignement(CodingMSA.java:615)
        at align.CodingMSA.buildProfile(CodingMSA.java:510)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:650)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0006401.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0006401.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0006401*.tmp': No such file or directory
rm: cannot remove 'OG0006401.AA.prot': No such file or directory
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.<init>(ProfileAligner.java:120)
        at align.CodingMSA.<init>(CodingMSA.java:64)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:633)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0008703.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0008703.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0008703*.tmp': No such file or directory
rm: cannot remove 'OG0008703.AA.prot': No such file or directory

then ValueError: invalid mode: 'rU' , but I have already met this error, it was caused by Python 3.11

The solution is using Python3.11 and removing the "U" from the function in input.py.

the next one

ERROR: LoadError: SystemError: opening file "CDS/OG0000762.cds": No such file or directory
Stacktrace:
 [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
   @ Base ./error.jl:176
 [2] #systemerror#80
   @ ./error.jl:175 [inlined]
 [3] systemerror
   @ ./error.jl:175 [inlined]
 [4] open(fname::String; lock::Bool, read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Nothing)
   @ Base ./iostream.jl:293
 [5] open(fname::String, mode::String; lock::Bool)
   @ Base ./iostream.jl:356
 [6] open(fname::String, mode::String)
   @ Base ./iostream.jl:355
 [7] top-level scope
   @ ~/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
in expression starting at /home/saenkos/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
ls: cannot access 'OG0000762-*.fasta': No such file or directory
fasta file:OG0000762.fasta not found
java.io.FileNotFoundException: OG0000762.fasta (No such file or directory)

then, but I am not sure this is really an error

signal (15): Terminated in expression starting at /home/saenkos/compare_my/compare_genomes/config/install_julia_packages.jl:1 ijl_uncompress_ir at /usr/local/src/conda/julia-1.8.3/src/ircode.c:862 InliningTodo at ./compiler/ssair/inlining.jl:870 [inlined] resolve_todo at ./compiler/ssair/inlining.jl:804 analyze_method! at ./compiler/ssair/inlining.jl:861 handle_match! at ./compiler/ssair/inlining.jl:1293 analyze_single_call! at ./compiler/ssair/inlining.jl:1210 assemble_inline_todo! at ./compiler/ssair/inlining.jl:1425 ssa_inlining_pass! at ./compiler/ssair/inlining.jl:82 jfptr_ssa_inlining_passNOT._16094.clone_1 at /home/saenkos/anaconda3/envs/myenv/envs/compare_genomes/lib/julia/sys.so (unknown line) Also, I get the up-to-date link to V17 Classification, because it is not a current release anymore.

http://data.pantherdb.org/ftp/sequence_classifications/17.0/PANTHER_Sequence_Classification_files/ in modules/setup.nf

So, I'm going to fix some of them and run the workflow one more time increasing CPUs and memory limits. Now it seems to be better than before.

I have similar error. I find my result is different from TEST in “Orthogroups.tsv”. my result: OG0002194 Dryobates_pubescens|XP_054020272.1 transmembrane protein 245 isoform X1 [Dryobates pubescens], Dryobates_pubescens|XP_054020273.1 transmembrane protein 245 isoform X2 [Dryobates pubescens], Dryobates_pubescens|XP_054020274.1 transmembrane protein 245 isoform X3 [Dryobates pubescens] Indicator_indicator|XP_054237876.1 transmembrane protein 245 [Indicator indicator] Melanerpes_aurifrons|TMEM245_rna-XM_015282350.2.3676, Melanerpes_aurifrons|TMEM245_rna-XM_015282351.2.3676, Melanerpes_aurifrons|TMEM245_rna-XM_015282352.2.3676 Upupaepops|NWU95202.1 TM245 protein partial [Upupa epops]

TEST: OG0017327 Arabidopsis_arenosa|CAE6190620.1 Arabidopsis_lyrata|XP_020875211.1 Arabidopsis_suecica|KAG7547138.1, Arabidopsis_suecica|KAG7620870.1 Arabidopsis_thaliana|NP_567536.1

my result's header contains protein name. But the source file format is the same.

my:

XP_009894245.2 pyroglutamylated RF-amide peptide receptor [Dryobates pubescens] MRSLNITPEQFAQLLRDNNVTREQFIALYGLQPLVYIPELPGRTKVAFVLICVLIFVLALFGNCLVLYVVTRSKAMRTVT NIFICSLALSDLLIAFFCVPFTMLQNISSNWLGGAFACKMVPFVQSTAIVTEILTMTCIAVERHQGIVHPLKMKWQYTNK RAFTMLGIVWLLALIVGSPMWHVQRLEVKYDFLYEKVYVCCLEEWASPIYQKIYTTFILVILFLLPLMLMLFLYTKIGYE LWIKKRVGDASVLQTIHGSEMSKISRKKKRAIVMMVTVVFLFAVCWAPFHVIHMMIEYSNFEKEYDDVTVKMIFAIVQII GFFNSICNPIVYAFMNENFKKNFLSAICFCIVKENSSPARHLGNLGITLRRQKAASQRDPVDSDEGRREAFSDGNIEVKF CDQPSSKRHLKRHLALFSSELTVHSALGNGQ TEST: KAG7527760.1 hypothetical protein ISN44_Un269g000010, partial [Arabidopsis suecica] IEDFVKEYHEAKDTPKDQNLKRPRQSNEEEPRSSKGKINVIIGGSKLCRDTINAIKKHRRNVLFKANLGEEMDFQGTSIS FDEEETCHLERPHDDALVITLDVANFEVSRILVDTGSSVDLIFLGTLERMGISRADIVGPPTPLVAFTSESAMSLGTIKL PVLAKNVSKIVDFVVFDKPAAYNIILGTPWIYQMKAVPSTYHQCIKFPTPSGVGTIRGSQEASRT

I would like to know if this problem originates from orthofinder or caompare_genomes and then how to solve it?

gdmdxl commented 7 months ago

I am really sorry, but after 14 hours the process stopped and the error is here again. I checked .command.err files which were generated only for the last run in the just-created copy of the workflow:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.adjustScoreMatrix(ProfileAligner.java:711)
        at align.ProfileAligner.alignProfiles(ProfileAligner.java:183)
        at align.CodingMSA.buildAlignement(CodingMSA.java:615)
        at align.CodingMSA.buildProfile(CodingMSA.java:510)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:650)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0006401.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0006401.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0006401*.tmp': No such file or directory
rm: cannot remove 'OG0006401.AA.prot': No such file or directory
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at align.ProfileAligner.<init>(ProfileAligner.java:120)
        at align.CodingMSA.<init>(CodingMSA.java:64)
        at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:633)
        at align.CodingMSA.run(CodingMSA.java:659)
        at utils.MacseMain.main(MacseMain.java:426)
fasta file:OG0008703.aligned.unsorted.cds.tmp not found
java.io.FileNotFoundException: OG0008703.aligned.unsorted.cds.tmp (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:111)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at bioObject.CodingDnaSeq.readFasta(CodingDnaSeq.java:562)
        at utils.MacseMain.main(MacseMain.java:590)
rm: cannot remove 'OG0008703*.tmp': No such file or directory
rm: cannot remove 'OG0008703.AA.prot': No such file or directory

then ValueError: invalid mode: 'rU' , but I have already met this error, it was caused by Python 3.11 The solution is using Python3.11 and removing the "U" from the function in input.py. the next one

ERROR: LoadError: SystemError: opening file "CDS/OG0000762.cds": No such file or directory
Stacktrace:
 [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
   @ Base ./error.jl:176
 [2] #systemerror#80
   @ ./error.jl:175 [inlined]
 [3] systemerror
   @ ./error.jl:175 [inlined]
 [4] open(fname::String; lock::Bool, read::Bool, write::Nothing, create::Nothing, truncate::Nothing, append::Nothing)
   @ Base ./iostream.jl:293
 [5] open(fname::String, mode::String; lock::Bool)
   @ Base ./iostream.jl:356
 [6] open(fname::String, mode::String)
   @ Base ./iostream.jl:355
 [7] top-level scope
   @ ~/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
in expression starting at /home/saenkos/compare_my/compare_genomes/scripts/extract_sequence_using_name_query.jl:57
ls: cannot access 'OG0000762-*.fasta': No such file or directory
fasta file:OG0000762.fasta not found
java.io.FileNotFoundException: OG0000762.fasta (No such file or directory)

then, but I am not sure this is really an error signal (15): Terminated in expression starting at /home/saenkos/compare_my/compare_genomes/config/install_julia_packages.jl:1 ijl_uncompress_ir at /usr/local/src/conda/julia-1.8.3/src/ircode.c:862 InliningTodo at ./compiler/ssair/inlining.jl:870 [inlined] resolve_todo at ./compiler/ssair/inlining.jl:804 analyze_method! at ./compiler/ssair/inlining.jl:861 handle_match! at ./compiler/ssair/inlining.jl:1293 analyze_single_call! at ./compiler/ssair/inlining.jl:1210 assemble_inline_todo! at ./compiler/ssair/inlining.jl:1425 ssa_inlining_pass! at ./compiler/ssair/inlining.jl:82 jfptr_ssa_inlining_passNOT._16094.clone_1 at /home/saenkos/anaconda3/envs/myenv/envs/compare_genomes/lib/julia/sys.so (unknown line) Also, I get the up-to-date link to V17 Classification, because it is not a current release anymore. http://data.pantherdb.org/ftp/sequence_classifications/17.0/PANTHER_Sequence_Classification_files/ in modules/setup.nf So, I'm going to fix some of them and run the workflow one more time increasing CPUs and memory limits. Now it seems to be better than before.

I have similar error. I find my result is different from TEST in “Orthogroups.tsv”. my result: OG0002194 Dryobates_pubescens|XP_054020272.1 transmembrane protein 245 isoform X1 [Dryobates pubescens], Dryobates_pubescens|XP_054020273.1 transmembrane protein 245 isoform X2 [Dryobates pubescens], Dryobates_pubescens|XP_054020274.1 transmembrane protein 245 isoform X3 [Dryobates pubescens] Indicator_indicator|XP_054237876.1 transmembrane protein 245 [Indicator indicator] Melanerpes_aurifrons|TMEM245_rna-XM_015282350.2.3676, Melanerpes_aurifrons|TMEM245_rna-XM_015282351.2.3676, Melanerpes_aurifrons|TMEM245_rna-XM_015282352.2.3676 Upupaepops|NWU95202.1 TM245 protein partial [Upupa epops]

TEST: OG0017327 Arabidopsis_arenosa|CAE6190620.1 Arabidopsis_lyrata|XP_020875211.1 Arabidopsis_suecica|KAG7547138.1, Arabidopsis_suecica|KAG7620870.1 Arabidopsis_thaliana|NP_567536.1

my result's header contains protein name. But the source file format is the same.

my:

XP_009894245.2 pyroglutamylated RF-amide peptide receptor [Dryobates pubescens] MRSLNITPEQFAQLLRDNNVTREQFIALYGLQPLVYIPELPGRTKVAFVLICVLIFVLALFGNCLVLYVVTRSKAMRTVT NIFICSLALSDLLIAFFCVPFTMLQNISSNWLGGAFACKMVPFVQSTAIVTEILTMTCIAVERHQGIVHPLKMKWQYTNK RAFTMLGIVWLLALIVGSPMWHVQRLEVKYDFLYEKVYVCCLEEWASPIYQKIYTTFILVILFLLPLMLMLFLYTKIGYE LWIKKRVGDASVLQTIHGSEMSKISRKKKRAIVMMVTVVFLFAVCWAPFHVIHMMIEYSNFEKEYDDVTVKMIFAIVQII GFFNSICNPIVYAFMNENFKKNFLSAICFCIVKENSSPARHLGNLGITLRRQKAASQRDPVDSDEGRREAFSDGNIEVKF CDQPSSKRHLKRHLALFSSELTVHSALGNGQ TEST: KAG7527760.1 hypothetical protein ISN44_Un269g000010, partial [Arabidopsis suecica] IEDFVKEYHEAKDTPKDQNLKRPRQSNEEEPRSSKGKINVIIGGSKLCRDTINAIKKHRRNVLFKANLGEEMDFQGTSIS FDEEETCHLERPHDDALVITLDVANFEVSRILVDTGSSVDLIFLGTLERMGISRADIVGPPTPLVAFTSESAMSLGTIKL PVLAKNVSKIVDFVVFDKPAAYNIILGTPWIYQMKAVPSTYHQCIKFPTPSGVGTIRGSQEASRT

I would like to know if this problem originates from orthofinder or caompare_genomes and then how to solve it?

@jeffersonfparil