twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Output HP-specific read names [Feature request] #12

Closed davidebolo1993 closed 2 years ago

davidebolo1993 commented 2 years ago

Hi guys,

thanks for the great work. I was just wondering if there is any chance you can add a flag to output haplotype-specific read names to a TSV (or whatever format you prefer) when running haplotag. This will ideally allows users to save time re-parsing the haplotagged BAM if one wants to extract haplotype-specific read names to "phase" outputs from other tools (for instance, methylation calls from Nanopolish). Thanks,

Davide

ythuang0522 commented 2 years ago

Hi @davidebolo1993, do you mean something like samtools view haptag.bam | grep ‘HP:i:1’ | cut -f 1 | grep -v '^@' > HP1.readnames samtools view haptag.bam | grep ‘HP:i:2’ | cut -f 1 | grep -v '^@' > HP2.readnames

davidebolo1993 commented 2 years ago

Something like samtools view haptag.bam | grep ‘HP:i:1’ | cut -f 1 works just fine. My thoughts are that we can avoid to read the tagged BAM again if those names are stored when reads are tagged (maybe by setting an additional flag). There is no need to hurry on this, I can do this easily with pysam or standard shell. Just wondering if this can be useful to the users.

Thanks

ythuang0522 commented 2 years ago

I never played with Nanopolish for methylation phasing. May I know if it can completely ignore the haplotagged bam and only needs two lists of read names? If so this would make sense to me. The haplotagging stage is very time-consuming due to the need to adding the tags back to the bam. The output of read names without bam might be faster.

davidebolo1993 commented 2 years ago

This is not something related to Nanopolish itself. What I meant is that I'm using haplotype-specific read names to parse the output table from Nanopolish and split this into haplotype-specific methylation frequencies. I'm currently doing this by re-parsing the HP-tagged BAM, looking at HP tag (if any) for each read, and use tagged reads to split the original table into 2 groups. I was asking if there is any chance to have a flag to output a file with something like {read_name : haplotype} - this is a json, but any format is fine with me.

ythuang0522 commented 2 years ago

Got it. Will spare time to implement this alternative output.

ythuang0522 commented 2 years ago

The v1.1 haplotag implemented --log which outputs a tabular plain-text file for your needs. It stores the assigned haplotype ID (i.e., 1 or 2) for each read and related statistics.

davidebolo1993 commented 2 years ago

Thanks !