njaupan / ecc_finder

a tool to detect eccDNA using Illumina and ONT sequencing
GNU General Public License v3.0
14 stars 5 forks source link

Missing output files #6

Open nemitheasura opened 2 years ago

nemitheasura commented 2 years ago

Hello,

i ran ecc_finder on my illumina seq data (in conda env, as suggested in manual site). The pipeline produced 2 output folders: align_files and peak_files and ecc.sr.site.bed file. I have noticed that .png file .fasta file and .csv file are missing.

Also, I have noticed that the test data produced an empty .csv and .png (there were no data within the plot, plotting area was empty) - due to the lack of provided reference genome, I used TAIR10.

Here is my console output - I tried to rerun the command, and it seems like it gets stuck at split read detection step.

Tue Feb 1 11:07:26 2022 --- CMD: python ecc_finder.py map-sr reference.fasta.mmidx FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_1.fq FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_2.fq -r reference.fasta --aligner minimap2 -o ecc_finder/ecc_finder_minimap2 Tue Feb 1 11:07:26 2022 --- INFO: Remove adapter Tue Feb 1 11:07:26 2022 --- INFO: Retaining pre-existing file: ecc_finder/ecc_finder_minimap2/align_files/ecc.sr.html Tue Feb 1 11:07:26 2022 --- INFO: Align the query raw read to the reference Tue Feb 1 11:07:26 2022 --- INFO: Retaining pre-existing file: ecc_finder/ecc_finder_minimap2/align_files/ecc.sr.sam Tue Feb 1 11:07:26 2022 --- INFO: Sort the alignments Tue Feb 1 11:07:26 2022 --- INFO: Retaining pre-existing file: ecc_finder/ecc_finder_minimap2/ecc.sr.bam Tue Feb 1 11:07:26 2022 --- INFO: Check read pair and read pair orientation Tue Feb 1 11:07:26 2022 --- INFO: Retaining pre-existing file: ecc_finder/ecc_finder_minimap2/ecc.sr.bam.bed Tue Feb 1 11:07:26 2022 --- INFO: Detecting sites of genomic enrichment Tue Feb 1 11:07:26 2022 --- INFO: Retaining pre-existing file: ecc_finder/ecc_finder_minimap2/ecc.sr.site.bed Tue Feb 1 11:07:26 2022 --- INFO: Detect split reads (ecc_finder) nemi@theasura-mint:~/ecc_finder$

Could you provide any info/support on that matter?

Best, Nemi

njaupan commented 2 years ago

Dear Nemi,

It is weird that it gets empty from test data, something definitely get wrong because everything should run successfully as indicated from YouTube videos. Can you attach screenshot of your terminal or script? Best regards

On Tue 1. Feb 2022 at 11.13, nemitheasura @.***> wrote:

Hello,

i ran ecc_finder on my illumina seq data (in conda env, as suggested in manual site). The pipeline produced 2 output folders: align_files and peak_files and ecc.sr.site.bed file. I have noticed that .png file .fasta file and .csv file are missing.

Also, I have noticed that the test data produced an empty .csv and .png (there were no data within the plot, plotting area was empty) - due to the lack of provided reference genome, I used TAIR10.

Could you provide any info/support on that matter?

Best, Nemi

— Reply to this email directly, view it on GitHub https://github.com/njaupan/ecc_finder/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5SXN3LIVMBEGFO2HRMBODUY6W33ANCNFSM5NIWFB5Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nemitheasura commented 2 years ago

Hi, currently I am rerunning the pipe with less workers. This time the error did not ocurred yet - it is at the minimap2 stage, so there is nothing in the console which would be worth looking at. I cannot provide you the screenshot from the previous run, because it crashed the server, so it needed to be restarted. RIP to my byobu session.

This was the culprit:

python ecc_finder.py map-sr reference.fasta.mmidx FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_1.fq FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_2.fq -r reference.fasta --aligner minimap2 -o ecc_finder/ecc_finder_minimap2 -t 16

The input fastqs are a bit large, circa 30GB each (yes, eccDNA enriched, just full capacity sequencing run).

I have noticed that it quickly starts to occupy an entire RAM (well, I have >200 GB at my disposal, so it is definitely a quite fine amount I think). And it crashes at the "Detect split reads" stage. Funny thing is, that it does not throw an error. I figured out that there must be something wrong because there was a description of the *.csv file in the manual. Also when I ran the analysis on the test data, it was clear to me that some outputs are really missing.

We enlarged our /tmp folder and decreased the amount of workers used (initially I ran this with 16 cores, but this time there ara only 5. This is why the entire analysis goes really slow.

I will keep you updated - if this would appear again.

Maybe it would be worth to consider, whether the tool could be optimized to work on larger data?

Best, Nemi

njaupan commented 2 years ago

Thank you for the details. Given any large data, it won’t be a problem for minimap2 to run alignment. However, as you indicated large RAM issue in the imitation mode, I realized it is from bedtools which used to detect split and discordant reads.

I have compared the performance of Samblaster and bedtools for this step,samblaster consequently speeded up and took less computational memory but circular reads is less detected. Given your data of high coverage, circular reads should have significant high copy numbers and then samblaster should be a better choice. It can be added to new version.

One thing of empty file is that when you rerun the same script, it will continue to next step if it found file with the same prefix name. In order to skip the alignment step and only use the successfully produced bam file, I would suggest you to 1) delete the folder of peak_files and empty failed file or 2) add different prefix name by “ -x “ and rename bam file so you don’t need to rerun alignment.

Best wishes

On Thu 3. Feb 2022 at 0.15, nemitheasura @.***> wrote:

Hi, currently I am rerunning the pipe with less workers. This time the error did not ocurred yet - it is at the minimap2 stage, so there is nothing in the console which would be worth looking at. I cannot provide you the screenshot from the previous run, because it crashed the server, so it needed to be restarted. RIP to my byobu session.

This was the culprit:

python ecc_finder.py map-sr reference.fasta.mmidx FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_1.fq FKDL202584971-1a-N706-AK425_H7W3WCCX2_L1_2.fq -r reference.fasta --aligner minimap2 -o ecc_finder/ecc_finder_minimap2 -t 16

The input fastqs are a bit large, circa 30GB each (yes, eccDNA enriched, just full capacity sequencing run).

I have noticed that it quickly starts to occupy an entire RAM (well, I have >200 GB at my disposal, so it is definitely a quite fine amount I think). And it crashes at the "Detect split reads" stage. Funny thing is, that it does not throw an error. I figured out that there must be something wrong because there was a description of the *.csv file in the manual. Also when I ran the analysis on the test data, it was clear to me that some outputs are really missing.

We enlarged our /tmp folder and decreased the amount of workers used (initially I ran this with 16 cores, but this time there ara only 5. This is why the entire analysis goes really slow.

I will keep you updated - if this would appear again.

Maybe it would be worth to consider, whether the tool could be optimized to work on larger data?

Best, Nemi

— Reply to this email directly, view it on GitHub https://github.com/njaupan/ecc_finder/issues/6#issuecomment-1028446598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5SXN47N6766GU5A65I7HLUZG3IXANCNFSM5NIWFB5Q . You are receiving this because you commented.Message ID: @.***>

nemitheasura commented 2 years ago

Thank you for your reply.

I wonder, because I have found a following workaround: if I split my read files into chunks and run a chunk, then I retrieve a correct output - all of the files, with the extensions expected. Can I run it in chunks and merge the outputs? Would this final output be legit?

Looking forward for hearing from you. Have a nice day!

Best, Nemi

njaupan commented 2 years ago

Hi Nemi,

I haven't tried this way, but it definitely will work for most of significant circular DNAs.

I would suggest after merging csv files of chunks into one, check top 5 locations with high supported reads. Circular DNA arising from repetitive sequences such as ribosomal DNA, centromere or telomere will be the possible controls, the remaining locations will be final interest.

To balance the trade off between abundance and accuracy of merged csv file, you could give a try to plot read number distribution, based on that set a decent threshold to remove location with low number supported reads.

Let me know it helps or not.

Best regards, Panpan

nemitheasura commented 2 years ago

Hi @njaupan, before I tried to run your program on splitted files, I launched it on the another server - with 1TB temp folder and 1008 GB RAM. Everything ran smoothly until it reached the "Detect split reads step". Then it crashed. The output from my console is shown below:

Sun Feb  6 19:49:18 2022 --- INFO: Detect split reads
Traceback (most recent call last):
  File "map-sr.py", line 380, in <module>
    main()
  File "map-sr.py", line 353, in main
    run_split(file_prefix,align_path,peak_path, num_threads, overwrite_files)
  File "map-sr.py", line 128, in run_split
    d =BedTool.merge(x,c = '5,6', o = 'collapse,collapse')
  File "/home/nemi/.conda/envs/ecc_finder/lib/python3.8/site-packages/pybedtools/bedtool.py", line 917, in decorated
    result = method(self, *args, **kwargs)
  File "/home/nemi/.conda/envs/ecc_finder/lib/python3.8/site-packages/pybedtools/bedtool.py", line 396, in wrapped
    stream = call_bedtools(
  File "/home/nemi/.conda/envs/ecc_finder/lib/python3.8/site-packages/pybedtools/helpers.py", line 460, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError: 
Command was:

        bedtools merge -o collapse,collapse -i /tmp/pybedtools._dwf99vj.tmp -c 5,6

Error message was:
Error: line number 279715765 of file /tmp/pybedtools._dwf99vj.tmp has 4 fields, but 6 were expected.

Do you have any idea how to debug it? I have enough disk space. My quota is >130 TB, so it is unlikely that the output was truncated due to the lack of space.

I will be grateful for the feedback.

Have a nice evening!

Best, Nemi

njaupan commented 2 years ago

Hi Nemi,

I totally agree that it is not RAM problem, but the intermedate file was empty for the 6th column so the bedtools merge did not work. There are places to dig in the script, but the data in question makes it easier. Could you please send me a random subset of your raw reads (njaupanpan@gmail.com) so I can take a look?

Best, Panpan

inud3mon commented 1 year ago

Hi Panpan,

I actually have the same problem as above - the code will not even run with the test Arabidopsis samples provide. It always crashes at the same point

bedtools merge -o collapse,collapse -i /tmp/.........-c 5,6

How can I fix this issue please?

Hi @nemitheasura I was also wondering if you were able to get past this issue?

Thanks,

Jenn