josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 7 forks source link

Illegal option -- #23

Closed seveein closed 1 month ago

seveein commented 2 months ago

Hi everyone, thank you very much for this nice contribution to the community! I tested GenEra on our cluster.

here, I observed the following on stdout:

Illegal option --
Illegal option --
Illegal option --
Illegal option --
Illegal option --
Illegal option --

After this, it seems like GenEra continues with the analysis:

Sun Sep  1 18:10:52 CEST 2024

Your temporary files will be stored in tmp_4084_16952

STARTING STEP 1: SEARCHING FOR HOMOLOGS WITHIN THE DATABASE USING DIAMOND
--------------------------------------------------
Matching the query genes against themselves
--------------------------------------------------
Searching for homologs against the DIAMOND database

To Reproduce

I use the Docker-Container via singularity:

singularity run -B $WORK/ GenEra/genEra.sif genEra -q spim_proteome_geneext.fasta -t 4084 -b db/nr -d db/taxdump/ -o out_spimp/

Is this a known bug, and can I trust that GenEra will continue without further issues? Thank you very much. cheers s

josuebarrera commented 2 months ago

Dear @seveein, I've never seen this error message before. I suspect it is not directly related to genEra, but to the singularity installation or the way that you are running singularity using -B $WORK/. Do these errors appear before or after the message genEra v1.X.X (C) Max Planck Society for the Advancement of Science? That way I can know if the error happened within the genEra code or not. Best, Josué

seveein commented 2 months ago

Hi Josué, The error appears right at the beginning, before the genEra [...] prompt. So it might be a singularity-related issue. GenEra seems to run, although we observed the following:

STARTING STEP 3: ASSIGNING AGES TO YOUR QUERY GENES WITH Erassignment
--------------------------------------------------
Splitting results per query gene using 16 threads
Fatal error: cannot open file 'Usage:': No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory

[...]

Running Erassignment using 16 threads
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/true

The gene_ages.tsv file is empty afterwards. Do you have an idea how we could resolve this issue? Best, s-

josuebarrera commented 2 months ago

Dear @seveein, The container seems to be working fine, and your error messages point towards some issues related to singularity. I suspect the main culprit is using singularity run instead of singularity exec.

Could you try something akin to this command?

# Establish the working directory for your genEra run
WORKDIR=/your/working/directory
cd $WORKDIR

# Add any other important path(s) for singularity to find (e.g., the path to the NR database or the directory where you wish to write the output files) 
export SINGULARITY_BIND="/any/other/relevant/path:/any/other/relevant/path"

 # Run genEra using 'exec' and specifying the absolute path of your files 
singularity exec /path/to/singularity/genera_latest.sif genEra \
-q sequences.fasta -t 4084 -b /path/to/database/nr \
-d /path/to/database/taxdump -n 16 \
-o /any/other/relevant/path/output

Please let me know if this works for you so I can update the wiki for singularity users.

Best, Josué

seveein commented 2 months ago

``Dear Josué, thank you very much already. Unfortunately, I still observe the same issues after implementing the adjustments.

Splitting results per query gene using 32 threads
Fatal error: cannot open file 'Usage:': No such file or directory
--------------------------------------------------
Running Erassignment using 32 threads
/usr/bin/false
/usr/bin/false

All user-provided paths are available to singularity. There should be also enough computational resources available.

Best, s.


edit: current singularity call:

export SINGULARITY_BIND="/path:/mnt"
singularity exec $HOME/programs/GenEra/genEra.sif genEra \
    -q /mnt/data/pep_clean.fasta \
    -t 4084 -b /mnt/TAI/db/nr \
    -d /mnt/TAI/db/taxdump/ \
    -c /mnt/TAI/out_spen//ncbi_lineages_2024-09-02.csv \
    -p /mnt/tmp_4084_7909/4084_Diamond_results.bout \
    -x /mnt/tmp_4084_7909/ \
    -o /mnt/TAI/out_spen/ \
    -n 32
josuebarrera commented 1 month ago

Dear @seveein,

Could you please send me the complete STDOUT log from the genEra run? I'd like to see which step is not working correctly in the pipeline. I see you're using the arguments -c and -p, meaning that at least step 1 and step 2 of the pipeline are running correctly. Could you also please verify that ncbi_lineages_2024-09-02.csv and 4084_Diamond_results.bout are not empty? Please send me the last 10 lines of these two files (i.e., tail ncbi_lineages_2024-09-02.csv and tail 4084_Diamond_results.bout) for me to check if step 1 and step 2 ran correctly.

Best, Josué

seveein commented 1 month ago

Dear @josuebarrera. Here are the complete STDOUT and -tail of ncbi_lineages,csv and 4084_Diamond_results.bout.

ncbi_lineages_2024-09-02.csv:

3315561,Eukaryota,Chordata,Lepidosauria,Squamata,Viperidae,Crotalus,Crotalus helleri,,Opisthokonta,Eumetazoa,Amniota,Sauropsida,Sauria,Bifurcata,Unidentata,Episquamata,Toxicofera,,,,Bilateria,,,,Deuterostomia,Vertebrata,Gnathostomata,Teleostomi,Euteleostomi,Dipnotetrapodomorpha,Tetrapoda,,,,,,,Serpentes,,Metazoa,,cellular organisms,,,,,,,,,,,,,,,,,Crotalinae,,,,Craniata,,Crotalus helleri caliginis,,Sarcopterygii,Colubroidea,,,,
3315602,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Sungouiella,Sungouiella xylosa,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315603,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Clavispora,Clavispora paralusitaniae,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315604,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Soucietia,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315605,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Sungouiella,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315606,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Osmozyma,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315610,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Tanozyma,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315611,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Gabaldonia,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3315612,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Wilhelminamyces,,,Opisthokonta,saccharomyceta,,,,,,,,,,,CUG-Ser1 clade,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,
3316682,Eukaryota,Ascomycota,Dipodascomycetes,Dipodascales,,,,,Opisthokonta,saccharomyceta,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Fungi,,cellular organisms,Dipodascales incertae sedis,,,,,,,,,,,,,,,,,,Dikarya,,Saccharomycotina,,,,,,,,,

4084_Diamond_results.bout

[RNAseq_work]$ tail tmp_4084_7909/4084_Diamond_results.bout
GeneExt~Sopen12g035030.1.p1     PNY24194.1      8.78e-06        65.1    45235
GeneExt~Sopen12g035030.1.p1     KQP56511.1      8.84e-06        63.9    1736321
GeneExt~Sopen12g035030.1.p1     CAE6417795.1    8.88e-06        65.1    456999
GeneExt~Sopen12g035030.1.p1     KAK2612949.1    8.92e-06        65.1    1105319
GeneExt~Sopen12g035030.1.p1     ABW09584.1      9.46e-06        64.7    298653
GeneExt~Sopen12g035030.1.p1     WP_291692835.1  9.80e-06        64.7    376
GeneExt~Sopen12g035030.1.p1     WP_269613200.1  9.90e-06        64.7    1219
GeneExt~Sopen12g035030.1.p2     PHT48895.1      7.48e-11        72.0    33114
GeneExt~Sopen12g035030.1.p3     PHT48910.1      2.76e-10        68.9    33114
GeneExt~Sopen12g035030.1.p3     PHT60957.1      1.07e-07        61.6    4072

Stdout:

Illegal option --
Illegal option --
Illegal option --
Illegal option --
Illegal option --
Illegal option --
genEra v1.4.0 (C) Max Planck Society for the Advancement of Science
Starting time of run:
Tue Sep  3 18:06:07 CEST 2024

Your temporary files will be stored in /mnt/tmp_4084_7909/tmp_4084_20086

DIAMOND OUTPUT ALREADY GENERATED. SKIPPING STEP 1

We're just going to quickly cluster the query genes against themselves for later on (step 3)

THE SPECIES-TAILORED TAXONOMIC DATABASE WAS PROVIDED BY THE USER. SKIPPING STEP 2

STARTING STEP 3: ASSIGNING AGES TO YOUR QUERY GENES WITH Erassignment
--------------------------------------------------
Splitting results per query gene using 16 threads
Fatal error: cannot open file 'Usage:': No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
sed: can't read Usage:: No such file or directory
sed: can't read [-a]: No such file or directory
sed: can't read args: No such file or directory
--------------------------------------------------
Running Erassignment using 16 threads
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/false
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/true
/usr/bin/false
/usr/bin/true
/usr/bin/false
/usr/bin/true
/usr/bin/false
/usr/bin/true
--------------------------------------------------
Running mcl to define gene families
.................................................. 1M
.................................................. 2M
.................................................. 3M
.................................................. 4M
.................................................. 5M
.................................................. 6M
.................................................. 7M
.................................................. 8M
.................................................. 9M
.................................................. 10M
.................................................. 11M
.................................................. 12M
.................................................. 13M
.................................................. 14M
.................................................. 15M
.................................................. 16M
.................................................. 17M
.................................................. 18M
.................................................. 19M
.................................................. 20M
.................................................. 21M
.................................................. 22M
.................................................. 23M
.................................................. 24M
.................................................. 25M
.................................................. 26M
.................................................. 27M
.................................................. 28M
.................................................. 29M
.................................................. 30M
.................................................. 31M
.................................................. 32M
.................................................. 33M
.................................................. 34M
.................................................. 35M
.................................................. 36M
.................................................. 37M
.................................................. 38M
.................................................. 39M
.................................................. 40M
.................................................. 41M
.................................................. 42M
.................................................. 43M
.................................................. 44M
.................................................. 45M
.................................................. 46M
.................................................. 47M
.................................................. 48M
.................................................. 49M
.................................................. 50M
.................................................. 51M
.................................................. 52M
........
[mclIO] writing </mnt/tmp_4084_7909/tmp_4084_20086/tmp_4084.mci>
.......................................
[mclIO] wrote native interchange 71751x71751 matrix with 52803377 entries to stream </mnt/tmp_4084_7909/tmp_4084_20086/tmp_4084.mci>
[mclIO] wrote 71751 tab entries to stream </mnt/tmp_4084_7909/tmp_4084_20086/tmp_4084.tab>
[mcxload] tab has 71751 entries
[mclIO] reading </mnt/tmp_4084_7909/tmp_4084_20086/tmp_4084.mcl>
.......................................
[mclIO] read native interchange 71751x16687 matrix with 71751 entries
--------------------------------------------------
Establishing the age and number of gene-family founder events
--------------------------------------------------
Step 3 finished!
The age assignment for your individual genes can be found in /mnt/TAI/out_spen//4084_gene_ages.tsv
The possible ages for the genes with a taxonomic representativeness below 30 percent can be found in /mnt/TAI/out_spen//4084_ambiguous_phylostrata.tsv
The estimation of gene family founder events can be found in /mnt/TAI/out_spen//4084_founder_events.tsv
The number of individual genes that could be assigned to each phylostratum are summarized in /mnt/TAI/out_spen//4084_gene_age_summary.tsv
The number of of gene family founder events per phylostratum are summarized in /mnt/TAI/out_spen//4084_founder_summary.tsv

genEra finished at:
Tue Sep  3 18:17:04 CEST 2024

Enjoy your results!!!

gene_age_summary.tsv


#number_of_genes        phylostratum    phylorank
0       Eukaryota       72
0       Streptophyta    71
0       Magnoliopsida   70
0       Solanales       69
0       Solanaceae      68
0       Solanum 67
0       Solanum pimpinellifolium        66
0       Embryophyta     65
0       Tracheophyta    64

Let me know whether you need anything else!

Best, s.

josuebarrera commented 1 month ago

Dear @seveein,

It seems that step 1 ran without any issues, so you can keep using the file 4084_Diamond_results.bout to avoid running that part of the pipeline again. I see two possible sources of error in the pipeline:

The first thing I noticed is that the file ncbi_lineages_2024-09-02.csv should be specified with -r instead of -c, since it is an intermediate file in step 2. Could it be that you have a file named 4084_ncbi_lineages.csv within the output files of your initial GenEra run? Because that is the file that you should specify to GenEra with the argument -c. It could be an error in step 2, but I can't imagine why it would fail. Try running GenEra using -r ncbi_lineages_2024-09-02.csv and let me know if the file 4084_ncbi_lineages.csv was generated.

The other possible source of error I see could be in the script FASTSTEP3R. This is an R script that makes step 3 run much faster than in the initial versions of GenEra, but it also consumes a considerable amount of memory and may be the cause of your errors. To verify this, could you please run GenEra again by adding the following argument: -F false This will disable fast mode for step 3, which should be able to run without any issues. I expect GenEra to take a considerable amount of time on this step though, given that you're working with a plant genome.

I'm still puzzled about the error Illegal option -- at the beginning of your log, but it is hopefully nothing to be worried about.

Please let me know if these two things solve your issues.

Cheers, Josué

seveein commented 1 month ago

Hi Jossué, thank you for your help. the 4084_ncbi_lineages.csv was also generated before. However, I adjusted the option from '-c' to '-r' for the test-run, which successfully completed Step 2.

Step 3 finished within minutes and returned the same error prompt again:

STARTING STEP 3: ASSIGNING AGES TO YOUR QUERY GENES WITH Erassignment
sed: can't read Usage: /usr/bin/which [-a] args: No such file or directory
sed: can't read Usage: /usr/bin/which [-a] args: No such file or directory
sed: can't read Usage: /usr/bin/which [-a] args: No such file or directory

The issue seems to be related to Step 3. Adjusting to -F false did not improve the situation. It happens pretty early in the execution. cheers, s.

seveein commented 1 month ago

Quick update:

I've been experimenting with the Singularity settings because it seemed that Singularity was mishandling environment variables.

Using the --cleanenv option has resolved the initial issue. The repeated

Illegal option --
Illegal option --
Illegal option --

messages are no longer appearing in the STDOUT.

Additionally, Step 3 appears to be running more stably now and isn't skipping the analysis. However, it’s still in progress, so please proceed with caution.

STDOUT:


genEra v1.4.0 (C) Max Planck Society for the Advancement of Science
Starting time of run:
Wed Sep  4 08:56:32 CEST 2024

Your temporary files will be stored in /mnt/TAI/TEMP/tmp_4084_1693

DIAMOND OUTPUT ALREADY GENERATED. SKIPPING STEP 1

We're just going to quickly cluster the query genes against themselves for later on (step 3)

THE SPECIES-TAILORED TAXONOMIC DATABASE WAS PROVIDED BY THE USER. SKIPPING STEP 2

STARTING STEP 3: ASSIGNING AGES TO YOUR QUERY GENES WITH Erassignment
--------------------------------------------------
Running Erassignment using 64 threads

troubleshooting command:

singularity run --cleanenv $HOME/programs/GenEra/genEra.sif genEra\
    -q pennellii_longest_orfs_pep_clean.fasta \
    -t 4084 -b /mnt/db/nr \ 
    -d /mnt/TAI/db/taxdump/ \
    -c /mnt/out_spen//4084_ncbi_lineages.csv \
    -p /mnt/tmp_4084_7909/4084_Diamond_results.bout \
    -x /mnt/TAI/TEMP/ \
    -o /mnt/TAI/out_spen/  \  
    -F false \
    -n 64

cheers, s.

josuebarrera commented 1 month ago

Dear @seveein,

It seems that --cleanenv solved the issue. You can check if step 3 is running correctly by checking inside of /mnt/TAI/out_spen/ to see if a file named 4084_gene_ages.tsv is being written. If GenEra takes too much time, you can try running it again by deleting -F false to enable fast mode (the final results should be the same).

Cheers, Josué

seveein commented 1 month ago

Dear @josuebarrera,
--cleanenv solved all the issues. Step 3 was just completed successfully. (maybe this could be added to the wiki as well) I appreciate your help! Best, S.