tecangenomics / nudup

NuDup -- Marks/removes duplicate molecules based on the molecular tagging technology used in Tecan products.
http://www.tecangenomics.com
GNU Lesser General Public License v3.0
14 stars 9 forks source link

Better description on how to append tag to read name (Case 2) #17

Closed cjfields closed 5 years ago

cjfields commented 5 years ago

In the documentation this is mentioned:

- CASE 2 (Runtime Optimized): User supplies only one input file,
 1) SAM/BAM file that a) ends with .sam or .bam extension b) contains unique
    alignments only c) is sorted d) has a fixed length sequence containing the
    molecular tag appended to each read name.

I am testing this approach for our pipeline to inline the work but it's not working. It would be very helpful to show (in the documentation) a specific example on what is expected for this format

Specifically, it's not clear if you mean the read name (the first part of the title) or the entire title line. If I simply append to the read name (first part), like so:

# before
@K00363:141:HWTFVBBXX:7:1101:2970:1894 1:N:0:AGTGAG
GAGTGGGAAAGTAGTATTGTTTTTTGTTTTTTTTGTGTTTTGTGTTATAAAGTCTCAAGTGCGGAAGAGGATGGGGAGGAATTGTGGTATCCAGGGTTGT
+
AAFFFJJFJFJAJJFJAFFJF-FJF--AFJ-7---A---F--A7JA----------7-<<A7777-7-<-7-<--7---7--<<-<-<-A---7-7----

# after
@K00363:141:HWTFVBBXX:7:1101:2970:1894CAGCAA 1:N:0:AGTGAG
GAGTGGGAAAGTAGTATTGTTTTTTGTTTTTTTTGTGTTTTGTGTTATAAAGTCTCAAGTGCGGAAGAGGATGGGGAGGAATTGTGGTATCCAGGGTTGT
+
AAFFFJJFJFJAJJFJAFFJF-FJF--AFJ-7---A---F--A7JA----------7-<<A7777-7-<-7-<--7---7--<<-<-<-A---7-7----

the code errors out with no useful message (note #12 ):

2018-09-18 10:21:07,193 [     INFO] - Deduplicating NuGEN single end reads...
2018-09-18 10:21:07,206 [     INFO] - Processing sorted SAM/BAM with molecular tag sequence in read name (assumes sorted)
2018-09-18 10:21:07,535 [    ERROR] -

Using the index file separately works fine however.

shuelga commented 5 years ago

Hi @cjfields - Appending should be done to the end of the entire read line. So in your case, after the 1:N:0:AGTGAG. You can use any delimiter to append. The first step of the CASE1 1 does this for you, so if you have the index file, I'd recommend just running it that way.

cjfields commented 5 years ago

Hi @shuelga I gave that a try but I'm seeing other problems: #10 . I'll test adding the bar code first, but I may have to test out the alternative script mentioned in that ticket.

cjfields commented 5 years ago

@shuelga adding the barcode to the end of the full read line worked (I simply add a colon + UMI sequence using a simple python script). I do find that I have to skip the step where the bismark read information is stripped out using strip_bismark_sam.sh, as this also very effectively removes the added UMI (I also noticed the UMIalso gets stripped out when I add the UMI to the read name as I mentioned above). The dedup step does still seem to work, though, but it may be worth mentioning this in the documentation somehow.

shuelga commented 5 years ago

That's correct @cjfields the strip_bismark_sam.sh step should only be used for CASE1. Thanks!

cjfields commented 5 years ago

Hi @shuelga I may send a few documentation fixes as a pull request. But it appears the pipeline run finished successfully. Thanks!