homeveg / nuctools

software for analysis of chromatin feature occupancy profiles from high-throughput sequencing data
GNU General Public License v3.0
15 stars 3 forks source link

extend_PE_reads.pl issues #13

Open axshweta opened 3 years ago

axshweta commented 3 years ago

Hi there,

I am trying to run these suite of tools on Paired-end 150 bp MNase ChIP-seq data and I'm having trouble with getting good data when I use the extend_PE_reads.pl script. The script seems to run okay (with the perl line instantiated Warnings) but the resulting occ files and the calculated avg_profiles don't look like they make much sense...

Interestingly, when I skip the extend_PE_reads.pl step and run all the downstream tools starting with the chromosome splitting step the data looks pretty okay ( the aggregate profile looks reasonable) but I'm worried about how to interpret results when this step is skipped since I'm not imposing any insert length constraints or formatting everything into a single line instead of separate mate pairs.

Perhaps the issue has to do with the way I'm processing my data: bwa --> sam --> bam --> sorted bam (by read name) --> bamtobed

here's the first few lines of my input bed: chr5 85343921 85344071 A00564:291:HNMKTDSXY:1:1101:1081:31798/1 60 + chr5 85343965 85344115 A00564:291:HNMKTDSXY:1:1101:1081:31798/2 60 - chr8 44643710 44643859 A00564:291:HNMKTDSXY:1:1101:1090:4445/1 0 + chr8 44643747 44643897 A00564:291:HNMKTDSXY:1:1101:1090:4445/2 0 - chr4 75704836 75704986 A00564:291:HNMKTDSXY:1:1101:1090:16376/1 60 + chr4 75704847 75704997 A00564:291:HNMKTDSXY:1:1101:1090:16376/2 60 - chr2 43733654 43733781 A00564:291:HNMKTDSXY:1:1101:1090:17378/1 60 - chr2 43733654 43733782 A00564:291:HNMKTDSXY:1:1101:1090:17378/2 60 + chr2 226656017 226656162 A00564:291:HNMKTDSXY:1:1101:1090:18505/1 60 + chr2 226656017 226656162 A00564:291:HNMKTDSXY:1:1101:1090:18505/2 60 -

first few lines of my output bed generated by the script: chr8 23266835 23267419 584 chr19 3875472 3875997 525 chr12 31950179 31950372 193 chr12 31950180 31950372 192 chr1 68616820 68617006 186 chr1 68616820 68617006 186 chr1 53738030 53739009 979 chr7 64952935 64953736 801 chr7 64952896 64953697 801 chr1 43652533 43653018 485

Please do let me know if I can provide anything else to help explain my issue better.

Thanks a lot!

Anu

axshweta commented 3 years ago

Okay, I took a look at the PE-extended bed output on IGV and it seems like many regions that are nucleosome-sized are being discarded and some that are not are being recognized. I wonder if I'm missing a sorting step?

axshweta commented 3 years ago

I think I see where the problem is in extend_PE_reads.pl. $read1 and $read2 do not have the same name. They differ by /1 and /2. Here's my low tech workaround where I remove the last character from the name to get it to work for me:

`my @newline1=split(/\t/, $line1); my @newline2=split(/\t/, $line2);

            my $chr_name_1=$newline1[0];
            my $chr_name_2=$newline2[0];

            my $min = min ($newline1[1],$newline2[1]);
            $min = $min <= 0 ? 0 : $min;
            my $max = max ($newline1[2],$newline2[2]);
            my $nuc_length = $max - $min;

            my $read_1=$newline1[3];
            my $read_2=$newline2[3];
            chop($read_1);
            chop($read_2);

`