gmarocena / gasv

Automatically exported from code.google.com/p/gasv
0 stars 0 forks source link

Cannot process more that 24 distinct chromosomes #7

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Start with an organism with more than 24 chromosomes or an incompletely 
scaffolded genome project.
2. Run BAM_preprocessor.pl on bam file of alignments.
3. Run gasv.

What is the expected output? What do you see instead?
Success is expected.
Fails with "The inputs file(s) are either not sorted properly or have 
additional data with chromosome number > 24!"

What version of the product are you using? On what operating system?
1.5.2

Please provide any additional information below.

Original issue reported on code.google.com by dan.kort...@adelaide.edu.au on 19 Apr 2012 at 1:08

GoogleCodeExporter commented 8 years ago
Thank you for your interest with GASV. 

I believe that your issue may lie with a conversion from BAM file to GASV ESP 
file format. Can you provide a snipit of the ESP files (*deletions, 
*inversions, etc)? A short sample of your BAM file, using samtools view, would 
be useful as well. (samtools view yourFile.bam).

Our GASV software is capable of processing more than 24 chromosomes, but all 
chromosome names must be integers (1, 2, etc.) 

If your BAM file has naming of the form chrX, CHRX, chromosomeX (or similar 
naming convention) this conversion should have been automatic. But, if your BAM 
file uses "unconventional" names, you need to include a conversion file.

Once we see what is going on we can make sure you are able to run GASV on your 
data set, and adjust our GASV code to have a more informative error message in 
this case.

Thank you again for your interest in GASV,

Suzanne

Original comment by sora...@gmail.com on 23 Apr 2012 at 3:07

GoogleCodeExporter commented 8 years ago
Thanks. Yes, I found the global in GASVMain and that resolved the issue
- I should have closed it here. Sorry.

However, I now get an error:
Encountered line HWI-ST960:64:C0999ACXX:2:1101:2088:2265        38      3024419 
3024519 +       38      3025523 3025623 -
 with matching chromosomes but has coordinate 3024419.0 that is smaller than the start of the current window, which would be 3035400! File probably isn't sorted correctly!

The input was created with BAM_preprocessor.pl thought, so I'm at a loss
to understand this. Should I open a new issue for that?

Original comment by dan.kort...@adelaide.edu.au on 23 Apr 2012 at 11:15

GoogleCodeExporter commented 8 years ago
Thanks for reporting back with an update.

We can continue to resolve things in this thread (no need to open another 
issue). 

The error message suggests that the input files are not sorted, which as you 
point out should happen automatically when generated with BAM_preprocessor.pl. 

Let's first try to pin point where the problem is by attempting to "resort" 
your ESP files.

You can sort ESP files into GASV order with the sortESP.bash script in GASV. 

The script is in gasv/lib/sortESP.bash; 

To Run: ./sortESP.bash fileToSort

Output: fileToSort.sorted

Run diff to compare the two ESP files (fileToSort, fileToSort.sorted) files. If 
the two files are the same, this suggests the BAM_preprocessor ran correctly, 
and the problem is with GASV.

If the problem is with the BAM_preprocessor, I'll direct that to the 
appropriate team member. But, if the sorting is correct, it would be helpful to 
have at least a snipit of your ESP file for debugging purposes.

Let me know how the sorting comparison goes.

Suzanne

Original comment by sora...@gmail.com on 24 Apr 2012 at 11:26

GoogleCodeExporter commented 8 years ago
The upshot of the sort is that the files are completely different in order. 
Actually, looking at the original file - this is obvious:

head -n 5 PB-B_AC0999ACXX_CGATGT_L002_all.translocation
HWI-ST960:64:C0999ACXX:2:1101:18473:2246        38      3024097 3024197 -       
77      96683   96783   +
HWI-ST960:64:C0999ACXX:2:1101:18966:2234        38      3020817 3020917 +       
77      106957  107057  -
HWI-ST960:64:C0999ACXX:2:1101:3214:2345 38      3000531 3000631 +       77      
104763  104863  -
HWI-ST960:64:C0999ACXX:2:1101:4649:2291 22      120210215       120210315       
+       38      78264307        78264407        -
HWI-ST960:64:C0999ACXX:2:1101:10305:2468        38      3028514 3028614 +       
77      97792   97892   +

Original comment by dan.kort...@adelaide.edu.au on 26 Apr 2012 at 12:16

GoogleCodeExporter commented 8 years ago
So far GASV is working fine after doing the sort, so that seems to be the issue.

thanks

Original comment by dan.kort...@adelaide.edu.au on 26 Apr 2012 at 12:45

GoogleCodeExporter commented 8 years ago
Wonderful - I'm glad that has fixed things for you. 

Do let us know if you have any other questions, and thank you again for your 
interest in GASV!

Suzanne

Original comment by sora...@gmail.com on 26 Apr 2012 at 12:56

GoogleCodeExporter commented 8 years ago

Original comment by sora...@gmail.com on 30 Apr 2012 at 4:24

GoogleCodeExporter commented 8 years ago
I don't know why I missed this earlier, but it would explain the behaviour:

<snip>
Parsing BAM file ...
Generating GASV inputs...
PB-5a_AC0999ACXX_ACAGTG_L005_all
GASV input generation complete. Sorting Files...
bash: sortESP.bash: No such file or directory
bash: sortESP.bash: No such file or directory
bash: sortESP.bash: No such file or directory
bash: sortESP.bash: No such file or directory
mv: cannot stat `PB-5a_AC0999ACXX_ACAGTG_L005_all.deletion.sorted': No such 
file or directory
mv: cannot stat `PB-5a_AC0999ACXX_ACAGTG_L005_all.divergent.sorted': No such 
file or directory
mv: cannot stat `PB-5a_AC0999ACXX_ACAGTG_L005_all.inversion.sorted': No such 
file or directory
mv: cannot stat `PB-5a_AC0999ACXX_ACAGTG_L005_all.translocation.sorted': No 
such file or directory
Sorting Complete.
<snip>

BAM_preprocessor.pl and sortESP.pl are not in PATH, on my system.

thanks

Original comment by dan.kort...@adelaide.edu.au on 3 May 2012 at 3:31