mozack / abra2

ABRA2
MIT License
90 stars 9 forks source link

YO tag format #32

Closed ionox0 closed 4 years ago

ionox0 commented 5 years ago

Thank you for your work in developing this tool,

We are wondering about samples which have YO:Z:N/A for the YO tag. Is this expected behavior and if so would you be able to provide any detail on the reason for an N/A value?

Thank you!

ionox0 commented 5 years ago

There is also the case where the tag is YO:Z:N/A:12345, with one instance of N/A.

Does this mean that the previous contig was N/A and the previous position of the read was 12345?

mozack commented 5 years ago

Hi, the YO tag stores the original alignment information and N/A stands for Not Applicable. This means that the read was originally unaligned. If you're using bwa, when one read in a pair is unmapped it is assigned to the mate's locus. The number after N/A is set to the read's original position (i.e mate locus) and is used downstream for sorting.

ionox0 commented 5 years ago

Thank you @mozack for the information.

Now that I understand the reason behind the N/A for the YO tag, it seems there may be an issue with the YO tag parsing on this line:

https://github.com/mozack/abra2/blob/master/src/main/java/abra/SortedSAMWriter.java#L107

Here is the stack trace from our error:

INFO    Thu Apr 04 15:26:51 EDT 2019    PROCESS_REGION_MSECS:   2_158655846_158655966   3676    63  11  0
INFO    Thu Apr 04 15:26:55 EDT 2019    PROCESS_REGION_MSECS:   1_11187651_11187891 5148    13  22  0
INFO    Thu Apr 04 15:26:55 EDT 2019    PROCESS_REGION_MSECS:   2_61725997_61726117 113 63  14  0
INFO    Thu Apr 04 15:26:55 EDT 2019    PROCESS_REGION_MSECS:   1_11188001_11188241 167 10  18  0
INFO    Thu Apr 04 15:26:55 EDT 2019    PROCESS_REGION_MSECS:   2_95849218_95849383 115 23  13  0
INFO    Thu Apr 04 15:26:55 EDT 2019    PROCESS_REGION_MSECS:   1_156846157_156846397   311 21  20  0
INFO    Thu Apr 04 15:26:56 EDT 2019    Clock time in Chromosome: 2_125000001_150000000: 35
INFO    Thu Apr 04 15:26:56 EDT 2019    Processing chromosome chunk: 2_175000001_200000000
INFO    Thu Apr 04 15:26:56 EDT 2019    Clock time in Chromosome: 1_100000001_125000000: 57
INFO    Thu Apr 04 15:26:56 EDT 2019    Processing chromosome chunk: 2_200000001_225000000
java.lang.ArrayIndexOutOfBoundsException: 1
    at abra.SortedSAMWriter.addAlignment(SortedSAMWriter.java:107)
    at abra.ReAligner.remapReads(ReAligner.java:757)
    at abra.ReAligner.processChromosomeChunk(ReAligner.java:404)
    at abra.ReAlignerRunnable.go(ReAlignerRunnable.java:21)
    at abra.AbraRunnable.run(AbraRunnable.java:20)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
java.lang.ArrayIndexOutOfBoundsException: 1
    at abra.SortedSAMWriter.addAlignment(SortedSAMWriter.java:107)
    at abra.ReAligner.remapReads(ReAligner.java:757)
    at abra.ReAligner.processChromosomeChunk(ReAligner.java:404)
    at abra.ReAlignerRunnable.go(ReAlignerRunnable.java:21)
    at abra.AbraRunnable.run(AbraRunnable.java:20)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
INFO    Thu Apr 04 15:26:56 EDT 2019    PROCESS_REGION_MSECS:   2_178098685_178099045   239 10  21  0

My guess is that there should be a check for N/A before trying to parse this field, similar to how it is handled here:

https://github.com/mozack/abra2/blob/a84739338038c32b313977b381597e3ce8941dbe/src/main/java/abra/SortedSAMWriter.java#L404

Let me know if I can provide more information.

ionox0 commented 5 years ago

I should note that we are attempting to run Abra two times, in order to find out whether an additional realignment step might give better results. This is possibly not an intended use case

mozack commented 5 years ago

Two runs with the current version of ABRA2 should be OK, although I doubt the results will improve much.

If you're trying to run ABRA2 against output from an older ABRA version, you may run into trouble as the format has changed in the YO tag.

The exception above appears to be during the sort phase. You could try disabling the sort using option: --nosort and sort the BAM file after ABRA2 completes.

However, please keep in mind that I have not explicitly tested running ABRA2 against older versions of ABRA.

ionox0 commented 5 years ago

Thank you very much for the info, we'll be sticking with ABRA2 to keep the YO tag format consistent.