josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

v1.4.0 : no tmp .bout files #18

Open Proginski opened 11 months ago

Proginski commented 11 months ago

Dear genEra developers,

Describe the bug The CDS of A thaliana I am using, won't be dated. I already succeded using genEra v1.4.0 with a subset of H sapiens' CDS Now using the enclosed fasta, even when providing 500Go RAM for 262Go of results, it does not work. Notice that I performed the same analysis (same command) with v1.2.0 and it went perfectly fine (except it took longer of course). I would have bet the problem is caused by the "|" character in the middle of the CDS name, but it worked with the previous version.

To Reproduce Steps to reproduce the behaviour, e.g.

genEra \
-t 3702\
-q CDS/cds_from_genomic.faa \
-b /diamonddb/NR_DB/nr \
-n 75 \
-r ncbi_lineages_2023-07-12.csv

Expected behaviour The ages are not assigned :

gene phylostratum rank taxonomic_representativeness

lcl|NC_000932.1_cds_NP_051037.1_48181 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051038.1_48226 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051039.1_48182 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051040.2_48183 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051041.1_48184 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051042.1_48185 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051043.1_48186 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051044.1_48187 Absent from the DIAMOND/MMseqs2 results NA NA lcl|NC_000932.1_cds_NP_051045.1_48188 Absent from the DIAMOND/MMseqs2 results NA NA

Screenshots or code Here are the last lines of the err file (16 Mo of similar 'No such file or directory' lines)

awk: cannot open /store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_lcl|NC_003070.9_cds_NP_001321941.1_644.bout (No such file or directory) rm: cannot remove '/store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_lcl|NC_003070.9_cds_NP_177334.1_10947.bout': No such file or directory awk: cannot open /store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_lcl|NC_003070.9_cds_NP_565027.1_10948.bout (No such file or directory) rm: cannot remove '/store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_lcl|NC_003071.7_cds_NP_001323584.1_19320.bout': No such file or directory .................................................. 1M .................................................. 2M .................................................. 3M .................................................. 4M ... [mclIO] writing </store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_3702.mci> ....................................... [mclIO] wrote native interchange 48227x48227 matrix with 4144755 entries to stream </store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_3702.mci> [mclIO] wrote 48227 tab entries to stream </store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_3702.tab> [mcxload] tab has 48227 entries [mclIO] reading </store/EQUIPES/BIM/MEMBERS/paul.roginski/Eukaryotes/GENERA/ATHA/tmp_3702_11608/tmp_3702.mcl> ....................................... [mclIO] read native interchange 48227x8569 matrix with 48227 entries

Session info:

Paul

josuebarrera commented 11 months ago

Dear Paul,

Thanks for reaching out! You are right, the new script for faster gene age assignment seems to mistake the "|" characters in the FASTA headers with column separators, leading to errors. We'll start working on a solution throughout the weekend, but I think it should be fairly easy to fix.

Cheers, Josué.

josuebarrera commented 11 months ago

Dear Paul,

@RocesV just fixed the issue with the fast headers containing | characters. Please download the newest version of FASTSTEP3R and let me know if this fixed your problem.

Best, Josué.

Proginski commented 11 months ago

Thanks a lot !

Paul