raphael-group / multibreak-sv

MultiBreak-SV identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.
12 stars 6 forks source link

error message: Too many open files #9

Open gn01786955 opened 7 years ago

gn01786955 commented 7 years ago

@annaritz The way of alignment is my reference genome to raw read who made m5 file

tig01.m5 4851123353 tig105.m5 199228139 venter-chr17.m5 91077781

my tig01.m5 message

m160310_101208_42180_c100906132550000001823204104301690_s1_p0/10/8333_8851 518 2 518 + chr17 4800114 4117601 4118110 - -2043 476 7 33 26 254 AAAGCTTTAT-GATG

my tig105.m5 message

m160310_101208_42180_c100906132550000001823204104301690_s1_p0/11/0_7102 7102 136 7095 + chr18 141020 106882 113698 + -25215 6242 115 602 459 254 TCCGAAAC

I use to M5toMBSV to execute my tig105.m5 file and happen error , But other file executes success

SPLITTING CLUSTERS FILE FOR MBSV...
0 clusters and 0 fragments
Final Iteration: 0

DONE. Use /bip7_disk/yuyu105/linkFILE/out-MBSVinputs/assignments.txt, /bip7_disk/yuyu105/linkFILE/out-MBSVinputs/experiments.txt, and the independent subproblem cluster files in /bip7_disk/yuyu105/linkFILE/out-MBSVinputs/cluster-subproblems/ to run MultiBreak-SV.

Exception in thread "main" java.io.FileNotFoundException: /bip7_disk/yuyu105/linkFILE/out-RunGASV/binned-esps/intrachrom-longread_27172_0.0-1.0 (Too many open files)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at java.io.FileReader.<init>(FileReader.java:58)
        at gasv.main.ReadESP.<init>(Unknown Source)
        at gasv.main.ReadInput.createReadFiles(Unknown Source)
        at gasv.main.ReadInput.readWindowFromFiles(Unknown Source)
        at gasv.main.ClusterESP.clusterESP(Unknown Source)
        at gasv.main.GASVMain.main(Unknown Source)

my question is
(1) Is my tig01.m5 file too much size ? (2) I split my tig01.m5 file and it became two files , Will it affect the SV results? (3) How to know which one is an insertion or inversion in the result of tig105.m5 file ?

my result of tig105.m5 file

c11     0.0000  0.9900  D       18      18229-22562     18      20244-24577     1       longread_4767
c15     0.0000  0.7900  I+      18      63022-63326     18      66563-66867     1       longread_1505_0.0-1.0
c17     0.0000  0.0000  D       18      70672-71317     18      75582-76227     1       longread_1811_0.0-1.0
c19     0.0000  0.3600  I+      18      112626-112961   18      116385-116720   1       longread_786_0.0-1.0
c18     0.0000  0.9900  I-      18      108323-108804   18      114093-114574   1       longread_951_0.0-1.0

Thank you for your help

annaritz commented 7 years ago

Hi @gn01786955, I see the "too many open files" error, but there seems to be an issue before that (which may be the root cause). After GASV clustering, the program breaks the clusters into independent subproblems which can be run in parallel. However, the lines below indicate that something is amiss:

SPLITTING CLUSTERS FILE FOR MBSV...
0 clusters and 0 fragments
Final Iteration: 0

There are 0 clusters listed. I see you have a .clusters file listed (though you call it tig105.m5 - is this a mistake?), but the support of each cluster (the second column) seems to be 0. This cannot be correct, since each cluster must have at least one read in it.

You said there were errors earlier - please provide these and I will take a closer look. To answer your questions at the bottom of the issue,

(1) Is my tig01.m5 file too much size ? Nope, it should work fine (though it may take a while to run)

(2) I split my tig01.m5 file and it became two files , Will it affect the SV results? Yes, since SVs take into account the number of reads that support them. We need to know the number of reads that support each SV, so you must provide the whole file. MBSV relies on breaking the data into independent subproblems, as indicated above, so this is where you will be able to see the performance improvement.

(3) How to know which one is an insertion or inversion in the result of tig105.m5 file ? The following information can be found in the GASV User Guide:

  1. D = Deletion
  2. I = Inversion IR = Reciprocal Inversion(both ++ and --) I+ = Inversion (++ side only) I- = Inversion (-- side only)
  3. V = Divergent.
  4. T = Translocation; TR = Reciprocal Translocation TN = Nonreciprocal

I haven't extensively looked for insertions: only deletions, inversions, and translocations.

zijuexiansheng commented 7 years ago

@annaritz Why the I+ and I- have both StartLocRange and EndLocRange reported? My understanding is, for I+, only the StartLocRange should be meaningful; and for I-, only the EndLocRange is meaningful. Am I wrong about that?

zijuexiansheng commented 7 years ago

I just got the Too many open files exception too. There is only one reference genome whose header is chr1. The error is

RUNNING GASV
writing to output directory mbsv-RunGASV/
java -Xms2g -Xmx5g -jar /afs/nd.edu/user34/szhu3/loonlocal/openbiosrc/GASV/gasv/bin/GASV.jar 
--cluster --batch --maximal --output regions --nohead --minClusterSize 1 --outputdir mbsv-RunGASV/ --verbose mbsv-RunGASV/gasv.in
Using window size of 43320
ClusterESP: processing chr 1, chr1
Exception in thread "main" java.io.FileNotFoundException: mbsv-RunGASV/binned-esps/intrachrom-longread_26025_0.0-1.0 (Too many open files)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileInputStream.<init>(FileInputStream.java:93)
    at java.io.FileReader.<init>(FileReader.java:58)
    at gasv.main.ReadESP.<init>(Unknown Source)
    at gasv.main.ReadInput.createReadFiles(Unknown Source)
    at gasv.main.ReadInput.readWindowFromFiles(Unknown Source)
    at gasv.main.ClusterESP.clusterESP(Unknown Source)
    at gasv.main.GASVMain.main(Unknown Source)
Final file is mbsv-RunGASV/gasv.in.clusters