nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
386 stars 182 forks source link

Problem with merging reads #308

Closed sanchari24 closed 4 years ago

sanchari24 commented 4 years ago

Hello, I am running Hic-Pro (version 2.11.1) on a cluster and it crashed at the following step. Kindly help.

`## mergeBAM.py

forward= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R1_paired_TAIR10.fasta.bwt2merged.bam

reverse= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2merged.bam

output= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_TAIR10.fasta.bwt2pairs.bam

min mapq= 20

report_single= False

report_multi= False

verbose= True

Merging forward and reverse tags ...

1000000

2000000

Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted. ` I checked the trimmed fastq files, they have the same number of reads

head trim_CHiC_clf_Root_S8_R1_paired.fastq

@NB501040:225:HV3KFBGXC:1:11101:10000:6066 1:N:0:CATGGC GGCTTCATGGGCCATTATTGGCCTTTTCTTCTTTTTTTTTTTTTTTTAATTTCCTTGTTTATATAAATTTTTGATA + AAAAAEEEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEAEE///A/E/E/E////A///AA/E/A/ @NB501040:225:HV3KFBGXC:1:11101:10000:7825 1:N:0:CATGGC TCTTTGACGGTTTATATATAATACACATAAAATGTAAATCTCCCTAACAAAAATGAGTAATGATACATATATACAC + AAAAAEAEEE/EA/EAE6EEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEAEEEAEE6AAEAEEEEEE/AEEAAA @NB501040:225:HV3KFBGXC:1:11101:10001:10268 1:N:0:CATGGC GGAAATAATACTTGTACTACAAAAATCAGAAATCTTAGCTGTTTTTGTACTTGATTGTTGGACTCACAGGTGGCGT

head trim_CHiC_clf_Root_S8_R1_paired.fastq

@NB501040:225:HV3KFBGXC:1:11101:10000:6066 1:N:0:CATGGC GGCTTCATGGGCCATTATTGGCCTTTTCTTCTTTTTTTTTTTTTTTTAATTTCCTTGTTTATATAAATTTTTGATA + AAAAAEEEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEAEE///A/E/E/E////A///AA/E/A/ @NB501040:225:HV3KFBGXC:1:11101:10000:7825 1:N:0:CATGGC TCTTTGACGGTTTATATATAATACACATAAAATGTAAATCTCCCTAACAAAAATGAGTAATGATACATATATACAC + AAAAAEAEEE/EA/EAE6EEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEAEEEAEE6AAEAEEEEEE/AEEAAA @NB501040:225:HV3KFBGXC:1:11101:10001:10268 1:N:0:CATGGC GGAAATAATACTTGTACTACAAAAATCAGAAATCTTAGCTGTTTTTGTACTTGATTGTTGGACTCACAGGTGGCGT My config file has the following PAIR1_EXT = _R1_paired PAIR2_EXT = _R2_paired

And I have one folder per sample.

nservant commented 4 years ago

Could you check in the logs if you have more errors please ?

sanchari24 commented 4 years ago

Hi, so I have these files in the folder /hic_results_28Jan/logs. Only the mergeSAM.log has the error. Rest other files do not report any error.

mergeSAM.log mapping_combine.log mapping_stats.log mapping_step1.log mapping_step2.log trim_CHiC_clf_Root_S8_R1_paired_TAIR10.fasta.bwt2glob.unmap_readsTrimming.log trim_CHiC_clf_Root_S8_R1_paired_bowtie2.log trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2glob.unmap_bowtie2.log trim_CHiC_clf_Root_S8_R2_paired_bowtie2.log trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2glob.unmap_bowtie2.log trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2glob.unmap_readsTrimming.log

The stderr file in my slurm folder has this

Wed 29 Jan 2020 05:18:24 PM CET Pairing of R1 and R2 tags ... Logs: logs/CHiC_clf_Root_S8/mergeSAM.log make: *** [/opt/share/Debian-Buster/HiC-Pro/HiC-Pro_2.11.1/bin/../scripts//Makefile:144: bowtie_pairing] Error 1

Any particular file I need to look into ?

nservant commented 4 years ago

mergeSAM.log please

sanchari24 commented 4 years ago

/usr/bin/python /opt/share/Debian-Buster/HiC-Pro/HiC-Pro_2.11.1/scripts/mergeSAM.py -q 20 -t -v -f Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R1_paired_TAIR10.fasta.bwt2merged.bam -r Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2merged.bam -o Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_TAIR10.fasta.bwt2pairs.bam

mergeBAM.py

forward= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R1_paired_TAIR10.fasta.bwt2merged.bam

reverse= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2merged.bam

output= Bowtie2/bwt2/CHiC_clf_Root_S8/trim_CHiC_clf_Root_S8_TAIR10.fasta.bwt2pairs.bam

min mapq= 20

report_single= False

report_multi= False

verbose= True

Merging forward and reverse tags ...

1000000

2000000

Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted.

nservant commented 4 years ago

Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted. Could you check your read name please ? in theory R1 reads and R2 reads should have the same names ...

sanchari24 commented 4 years ago

Hello, I checked the first 3000000 lines in the "...fasta.bwt2merged.bam" files in the bwt2 folder and exported them as text files using samtools view trim_CHiC_clf_Root_S8_R2_paired_TAIR10.fasta.bwt2merged.bam|head -3000000

I made a diff command on R1 vs R2 and R2 vs R1 I get the following differences. I see that sometimes the difference is due to ">" signs and the numbers. Any idea how to resolve this ?

$diff R1.txt R2.txt

2255646d2255645 < NB501040:225:HV3KFBGXC:1:11311:14928:7992 0 2 13012134 42 76M 0 0 GATTAGAATCTTACAATTTATGTGGACCAAGCTTTTACTTTTTCTCTCGAAGAATATTATTTTGTTCAAGAATCAT AAAAAEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG 2875065d2875063 < NB501040:225:HV3KFBGXC:1:12109:8874:16587 4 0 0 0 0 GGAACCAGCAATCATTGTTCAAAAATCGCAGATAATGGAAGTAATGGAGATATAGCATGTGATGGGTATCACAAAT AA/AA/AEAEE/EEAE6EEEEEEEE/EEEAAEEEEEEEE/AEEEEE<EEEE6EEEEEEE<AEAEEEEEEEE//EE/ YT:Z:UU RG:Z:BML 3000000a2999999,3000000 > NB501040:225:HV3KFBGXC:1:12111:6699:7303 0 3 1632615 42 76M * 0 0 ATCGCAGCGGTTCGACGCGGAATGCAGCTCTTCAGACAGGTAAATTGGAAAATTCCGACAAACAGAACTTTTCTAA AAAAAEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG > NB501040:225:HV3KFBGXC:1:12111:6699:7722 16 5 25864489 42 76M * 0 0 GAGTCTACATTGTCTGATTTAGGAAACAAATGAGTTACAAGAGAGACTACAGAGAAATTACCAATCAGGAGTAGCT /EAEEAEEEEEE/EEEEEEEEAEEAE/<EEEE</EEEEEAAEEAEEEEEE/EEE/EEEAEEE/6EEEAEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG

$diff R2.txt R1.txt

2255645a2255646 > NB501040:225:HV3KFBGXC:1:11311:14928:7992 0 2 13012134 42 76M * 0 0 GATTAGAATCTTACAATTTATGTGGACCAAGCTTTTACTTTTTCTCTCGAAGAATATTATTTTGTTCAAGAATCAT AAAAAEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG 2875063a2875065 > NB501040:225:HV3KFBGXC:1:12109:8874:16587 4 0 0 0 0 GGAACCAGCAATCATTGTTCAAAAATCGCAGATAATGGAAGTAATGGAGATATAGCATGTGATGGGTATCACAAAT AA/AA/AEAEE/EEAE6EEEEEEEE/EEEAAEEEEEEEE/AEEEEE<EEEE6EEEEEEE<AEAEEEEEEEE//EE/ YT:Z:UU RG:Z:BML 2999999,3000000d3000000 < NB501040:225:HV3KFBGXC:1:12111:6699:7303 0 3 1632615 42 76M 0 0 ATCGCAGCGGTTCGACGCGGAATGCAGCTCTTCAGACAGGTAAATTGGAAAATTCCGACAAACAGAACTTTTCTAA AAAAAEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG < NB501040:225:HV3KFBGXC:1:12111:6699:7722 16 5 25864489 42 76M * 0 0 GAGTCTACATTGTCTGATTTAGGAAACAAATGAGTTACAAGAGAGACTACAGAGAAATTACCAATCAGGAGTAGCT /EAEEAEEEEEE/EEEEEEEEAEEAE/<EEEE</EEEEEAAEEAEEEEEE/EEE/EEEAEEE/6EEEAEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU RG:Z:BMG

nservant commented 4 years ago

This is the first time I see that. So far, the only idea would be create new files that fix this issue. For instance, using a simple awk command line, you should be able to remove the "</>"

sanchari24 commented 4 years ago

Hi, so these ">" are not there in the fastq files, but appearing after we get the bam files. If I edit the "....._R1_paired_TAIR10.fasta.bwt2merged.bam" files, how do I launch the HiC-Pro steps with the edited files as the starting point?

nservant commented 4 years ago

That's strange. To my knowledge, bowtie2 is not expected to change the read names ... To run HiC-Pro from the R1/R2.bam files, use the same data organization, and use the stepwise mode of HiC-Pro (-s options)

sanchari24 commented 4 years ago

Hello, I was finally able to make the R1 and R2 files comparable and was able to merge them as shown below. However, the final Valid pair file size is 0 kb. I have attached some of the statistics files in case if you can kindly figure out anything. By the way, my data is Capture HiC. I have pasted the HiC-Pro config file for your reference. Thank you for your time.

HiC-Pro Mapping Statistics CHiC_clf_Shoot_S3/trim_CHiC_clf_Shoot_S3_R1_paired_TAIR10.fasta.mapstat total_R1 64764546 mapped_R1 59933519 global_R1 47827791 local_R1 12105728

HiC-Pro Mapping Statistics CHiC_clf_Shoot_S3/trim_CHiC_clf_Shoot_S3_R2_paired_TAIR10.fasta.mapstat total_R2 64764546 mapped_R2 59933766 global_R2 47828593 local_R2 12105173

CHiC_clf_Shoot_S3.mpairstat merge_statfiles.py dir= Bowtie2/bwt2/CHiC_clf_Shoot_S3/ pattern= *.pairstat Merging 1 files Use /ips2/IPS2 (7864)/DDEVE (8095)/CHROCYC (8096)/sanchari/David_HiChIP_polycomb_project/Capture_HiC_polycomb_mutant_shoot_root/CHiC_clf_Shoot_S3_hic_result/Bowtie2/bwt2/CHiC_clf_Shoot_S3/trim_CHiC_clf_Shoot_S3_TAIR10.fasta.bwt2pairs.pairstat as template Total_pairs_processed 64764546 100.0 Unmapped_pairs 4830779 7.459 Low_qual_pairs 10173345 15.708 Unique_paired_alignments 49760173 76.832 Multiple_pairs_alignments 0 0.0 Pairs_with_singleton 249 0.0 Low_qual_singleton 0 0.0 Unique_singleton_alignments 0 0.0 Multiple_singleton_alignments 0 0.0 Reported_pairs 49760173 76.832

CHiC_clf_Shoot_S3 File

merge_statfiles.py dir= CHiC_clf_Shoot_S3_hic_result/data//CHiC_clf_Shoot_S3/ pattern= *.RSstat Merging 1 files Use /ips2/IPS2 (7864)/DDEVE (8095)/CHROCYC (8096)/sanchari/David_HiChIP_polycomb_project/Capture_HiC_polycomb_mutant_shoot_root/CHiC_clf_Shoot_S3_hic_result/CHiC_clf_Shoot_S3_hic_result/data/CHiC_clf_Shoot_S3/trim_CHiC_clf_Shoot_S3_TAIR10.fasta.bwt2pairs.RSstat as template Valid_interaction_pairs 0 Valid_interaction_pairs_FF 0 Valid_interaction_pairs_RR 0 Valid_interaction_pairs_RF 0 Valid_interaction_pairs_FR 0 Dangling_end_pairs 0 Religation_pairs 0 Self_Cycle_pairs 0 Single-end_pairs 0 Filtered_pairs 499519 Dumped_pairs 49260654

Config File

Please change the variable settings below if necessary

#########################################################################

Paths and Settings - Do not edit !

#########################################################################

TMP_DIR = tmp LOGS_DIR = logs BOWTIE2_OUTPUT_DIR = Bowtie2 MAPC_OUTPUT = CHiC_clf_Shoot_S3_hic_result RAW_DIR = Trimming_v1

#######################################################################

SYSTEM - PBS - Start Editing Here !!

####################################################################### N_CPU = 16 LOGFILE = hicpro.log

JOB_NAME = Capture_HiC_polycomb_mutant_shoot_root_2Feb JOB_MEM = 60gb JOB_WALLTIME = 6:00:00 JOB_QUEUE = batch JOB_MAIL = sanchari.sircar@u-psud.fr

#########################################################################

Data

#########################################################################

BECAREFUL IT DOESN'T EXCEPT REGULAR EXPRESSION! YOU CAN'T PUT STAR

PAIR1_EXT = _R1_paired PAIR2_EXT = _R2_paired

#######################################################################

Alignment options

#######################################################################

FORMAT = phred33 MIN_MAPQ = 20

BOWTIE2_IDX_PATH = /K/CHROCYC/sanchari/Arabido_Bowtie2_indexes/ BOWTIE2_GLOBAL_OPTIONS = --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder BOWTIE2_LOCAL_OPTIONS = --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder

#######################################################################

Annotation files

#######################################################################

REFERENCE_GENOME = TAIR10.fasta GENOME_SIZE = TAIR10.sizes.genome

#######################################################################

Allele specific

#######################################################################

ALLELE_SPECIFIC_SNP =

#######################################################################

Digestion Hi-C

#######################################################################

GENOME_FRAGMENT = DpnII_resfrag_TAIR10.bed LIGATION_SITE = GATC MIN_FRAG_SIZE = 50 MAX_FRAG_SIZE = 100000 MIN_INSERT_SIZE = 100 MAX_INSERT_SIZE =

#######################################################################

Hi-C processing

#######################################################################

MIN_CIS_DIST = GET_ALL_INTERACTION_CLASSES = 1 GET_PROCESS_SAM = 1 RM_SINGLETON = 1 RM_MULTI = 1 RM_DUP = 1

#######################################################################

Contact Maps

#######################################################################

BIN_SIZE = 1000 2000 5000 10000 MATRIX_FORMAT = upper

#######################################################################

ICE Normalization

####################################################################### MAX_ITER = 100 FILTER_LOW_COUNT_PERC = 0.02 FILTER_HIGH_COUNT_PERC = 0 EPS = 0.1

Ferossitiziano commented 4 years ago

Dear Nicolas,

I am having the same issue described in this post. The error message I get in logs/sample/mergeSAM.log is:

"Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted"

My _R1 and _R2 files have the very same read names, in the same order, however, the R1_mm10.bwt2merged.bam and R2_mm10.bwt2merged.bam have the read name in different orders, because .bam files are not sorted by name.

Would it be possible for you to fix the pipeline so that the merged bam files are sorted in the proper way?

nservant commented 4 years ago

Hi HiC-Pro is expected to sort the BAM in a proper way before merging them. One assumption could be that the sort crashed for any reason ... do you have message in the log files ? N

Ferossitiziano commented 4 years ago

Yes, here is my mergeSAM.log file:

/hpcnfs/home/ieo5073/miniconda3/envs/HiCpro/bin/python /hpcnfs/home/ieo5073/miniconda3/envs/HiCpro/bin/HiC-Pro_2.11.3/scripts/mergeSAM.py -q 10 -t -v -f bowtie_results/bw t2/WTRing1B/WTRing1bFPm_00_R1_mm10.bwt2merged.bam -r bowtie_results/bwt2/WTRing1B/WTRing1bFPm_00_R2_mm10.bwt2merged.bam -o bowtie_results/bwt2/WTRing1B/WTRing1bFPm_00_mm1 0.bwt2pairs.bam

mergeBAM.py

forward= bowtie_results/bwt2/WTRing1B/WTRing1bFPm_00_R1_mm10.bwt2merged.bam

reverse= bowtie_results/bwt2/WTRing1B/WTRing1bFPm_00_R2_mm10.bwt2merged.bam

output= bowtie_results/bwt2/WTRing1B/WTRing1bFPm_00_mm10.bwt2pairs.bam

min mapq= 10

report_single= False

report_multi= False

verbose= True

Merging forward and reverse tags ...

Forward and reverse reads not paired. Check that BAM files have the same read names and are sorted.

Thank you very much, Federico

Ferossitiziano commented 4 years ago

Sorry, I've just found an error message in the mapping_combine.log:

samtools sort: couldn't allocate memory for bam_mem

I guess this explain the issue with the sorting. I will try to fix this issue before proceeding any further.

Thanks, Federico

nservant commented 4 years ago

Yes exactly, BAM are expected to be sorted by name at the step before. The point is why HIC-Pro did not exit with an error status ... strange .. Anyway, you have to fixe the memory issue :)

StephF-1130 commented 4 years ago

Hi, I had the same problem. Once you fix the problem, how do you rerun hicpro from that point forward instead of starting the whole thing over?

nservant commented 4 years ago

Hi Stephanie, Which problem ? the sort issue ? You can run HiC-Pro in stepwise mode with the -s option (see the manual for details). In this case, the input (-i) has to be updated for each step. As you did before to create the maps, you provided the .validPairs as an input ... Best