Wardale24 commented 4 years ago

When running the TSV_to_Variants_Freq.py3 with a 100GB .tsv file, the process kills after about 1 hour with over 200GB of available memory remaining. I am pretty sure that it is a technical issue (i.e. RAM or other hardware issues) as I am not running it on a server put on a workbench. The created .csv files are empty.

However, I wanted to open the discussion in case of future occurrences and to possibly solve other related issues.

enovoa commented 4 years ago

Hi @Wardale24 - Can you please provide further details of the commands run, as well as including first lines of the input file that you are feeding to the script? What is the error message that you receive? Thanks

enovoa commented 4 years ago

This is a similar issue than reported previously: https://github.com/enovoa/EpiNano/issues/43 -- I Have followed up on other GitHub issue as well to see if solution provided solved the issue. Both issues are most likely related to resource/memory requirements.

Wardale24 commented 4 years ago

Yes it clearly looks like it was the same issue, I apologize for not seeing the closed discussion earlier.

I ran $python Epinano/scripts/TSV_to_Variants_Freq.py3 -f sample.tsv -t 10

The header of the file is

Read-Name Flag MAPQ CHROM READ-POS0 READ-BASE READ-QUAL REF-POS1 REF-BASE CIGAR-OP

2cf47d4d-aa9b-4a6b-9818-2a746b59825e 16 60 1 0 G +3583 A S 2cf47d4d-aa9b-4a6b-9818-2a746b59825e 16 60 1 1 G .3584 C S 2cf47d4d-aa9b-4a6b-9818-2a746b59825e 16 60 1 2 G +3585 C S

I will attempt what Huanle suggested and split my .tsv file and run them separately. I will update as soon as I find out.

Thanks again for the response

enovoa commented 4 years ago

Sure - perhaps I would also recommend trying the solution that worked in #43 - to analyze only those genomic regions with minimal coverage (the regions with low coverage should anyways be discarded at later stages of the analysis).

Wardale24 commented 4 years ago

Tried running it with half the file and still had issues. However, I took one read (around 800 lines of the .tsv file) and tested it just to make sure it was hardware issues. It worked this time which is great. However I noticed that the output 5mer csv file is over double the input tsv file, which would explain why my process killed.

Thank you for taking the time to answer my questions, I am very grateful, I believe I can close the issue.

Huanle commented 4 years ago

Hi @Wardale24 , Sorry for my late response. The 5mer.csv file should not be that huge. maybe you are refereing to the intemedite tmp files. Anyway, i will make some changes eanbling piping with the java program, which means you do not produce the TSV file seperately and replace pandas dataframe with dask counterpart, which saves RAM.

Wardale24 commented 4 years ago

Hey @Huanle ,

Yes you're correct. It is the temporary file. I wanted to reopen the issue and specify everything because it has killed the process again.

I separated my .tsv files into 4 quarters and then analyzed the first one (24GB) while using 4 threads. This time the process killed but did not create empty .csv files. It created a "sample1.tsv.per.site.var.csv" that is 840MB and a temporary 5mer file that is 7,5GB.

AW

Huanle commented 4 years ago

Thanks @Wardale24 for reporting this. I will release the newer version very soon.

Wardale24 commented 4 years ago

Hello @Huanle Just a quick update. I ran it overnight with the first 150M lines (10GB tsv file) with no extra threads and it did not kill the process. It did however create another massive csv file (not tmp). The sample.per.site.var.csv was 400MB and the 5mer.csv is 1.3GB. I am guessing this shouldn't be this way?

Huanle commented 4 years ago

Hi @Wardale24 , The output file name indicates that this is the per-site variants frequencies (including per position base-calling qualities) file. The temporary folder and files within it are deleted by default. So your run should be fine. I will be able to confirm with you if you can head a few lines of your output.

Wardale24 commented 4 years ago

Hello @Huanle

I understand that the run went correctly. My question is should the 5mer.csv file be so large in comparison to the other files? (input .tsv 10GB, per.site.var.csv 400MB, 5mer.csv 1.3GB)

Regardless, here is a head of the 5mer.csv file

Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5

ACACA,11931657:11931658:11931659:11931660:11931661,47,1.0:1.0:1.0:1.0:1.0,6.00000,14.00000,9.00000,0.00000,10.00000,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0 ATACA,17536166:17536167:17536168:17536169:17536170,60,12.0:12.0:12.0:12.0:12.0,21.33333,18.91667,7.40000,9.00000,6.44444,1.0,1.0,0.8333333333333334,1.0,0.75,0.0,0.0,0.0,0.08333333333333333,0.08333333333333333,0.0,0.0,0.16666666666666666,0.0,0.25

And here is a head of the per site csv file:

Ref,pos,base,cov,q_mean,q_median,q_std,mis,ins,del

60,3663,C,1.0,8.00000,8.00000,0.00000,1.0,0.0,0.0 60,3664,T,1.0,8.00000,8.00000,0.00000,1.0,0.0,0.0

Huanle commented 4 years ago

Hi @Wardale24 ,

It looks correct to me in terms of their formats. You should be able to find all positions from one of them in the other file.

novoalab / EpiNano

Empty .csv files #50

Read-Name Flag MAPQ CHROM READ-POS0 READ-BASE READ-QUAL REF-POS1 REF-BASE CIGAR-OP

Kmer,Window,Ref,Coverage,q1,q2,q3,q4,q5,mis1,mis2,mis3,mis4,mis5,ins1,ins2,ins3,ins4,ins5,del1,del2,del3,del4,del5

Ref,pos,base,cov,q_mean,q_median,q_std,mis,ins,del