jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
373 stars 80 forks source link

sqm_longreads yields wrong mcount table #686

Open jllavin77 opened 1 year ago

jllavin77 commented 1 year ago

Dear developers,

I have run sqm_longreads on 24 ONT samples and the .out.allreads.mcount file is not correct, Every sample's "reads" related columns are either empty or filled with 0 values, even though "ORF" columns are correct and account for the detected ORFS in each read. Here is a sample of the table to illustrate my words:

imagen

I tried the same analysis with 1 sample and 9 samples and in both cases, the .out.allreads.mcount file was completely correct.

The command I run was (I'm using the latest available version of SqueezeMeta 1.6.2):

sqm_longreads.pl -p PROJECT -s list.txt -f /storage/SQM/PROJECT/ --euk -t 16

Is there any problem if I analyze more than 10 samples in the same batch (when it comes to parsing each sample's results and generating the full table)?

By the way, I can't import the results into SQMtools as I receive an error referring to the results parsing... I presume it is caused by the malformed .mcount table.

I would appreciate it if you could help me fix this problem due to its urgency.

Thanks in advance & Best wishes

JL

jllavin77 commented 1 year ago

No ideas on this issue?

fpusan commented 1 year ago

Hi, Sorry for the delay! The error in sqmreads2tables.py will indeed come from the malformed mcount table, but I am not sure on what is causing that in the first place. Can you try running it without the --euk flag and see if the error persists?

jllavin77 commented 1 year ago

Thank you for your answer!

I will try to run it that way as soon as I get enough disk available for the run.

The thing is that it didn't happen with the "shorter" (less than 10 samples) runs...

fpusan commented 1 year ago

This is indeed weird. Were the shorter runs also using the same parameters?

jllavin77 commented 1 year ago

Yep, only changed paths/filenames in the run.

jtamames commented 1 year ago

Hello! Sorry for the delay. Could you please do a "tree" command of your project directory? Sometimes the results are produced but the tables are not correctly generated, I want to know if this is the case. Best,

jllavin77 commented 1 year ago

Hello everybody,

SOrry for the delay but I was following your suggestions before answering back.

1) I ran the project removing the --euk tag , as suggested, and the results table is still malformed

2) Find the outcome of the tree command attached (I missread your coment the other day) Tree_SQM.txt

I hope you can spot the problem, because I have already run the project twice (8 full days running each time) and feel a little concerned with this issue.

Thanks in advance

JL

jllavin77 commented 1 year ago

Any insight after looking at the results' directory contents?

I've run the test_install.pl script and All checks seem to be correct. I've tried to run in on 1 and 5 samples only and this time the .mcounts file seems to be incorrect in both cases, now... The only thing I can think about is that I updated SqueezeMeta to version 1,6 before running this batch of samples... Has anyone experienced this issue too?

jllavin77 commented 1 year ago

I finally found out what is the problem. To sum up quickly, every time the script sqm_longreads stops in the middle of any run for any reason (e.g. power shortage), when you restart the run, the file *.out.allreads resets to 0 Kb, therefore, losing any previous information about each sample's reads stored there up to that moment. That is why the reads are lost, and the correspondent column appears empty for those samples. Another thing I tried, was to rerun the script on the previous results trying to recover the hits info, but that doesn't work either. It recognizes the previous results and gives you the options to overwrite or keep those results, but if you keep them, the reads related info is still missing.

Is this something that you can fix or should we (the users) accept that sqm_longreads is not a script that can benefit from your awesome "restart" feature?

Thanks in advance & Best wishes

JL

jtamames commented 1 year ago

Thanks for the insight! I guess we could read the allreads file in case it exists and has content in it, record the last processed query sequence, restart with the rest, and merge the result. Not immediate, but it could be done. Nevertheless Diamond, who is generating that file, often takes a lot of time to store its first results in the file. Not sure if that is also customizable and what could be the impact in performance. I am putting this in my list of things to look at. Best, J