nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
382 stars 183 forks source link

Fastq end mis-identification #419

Open ArielPaulson opened 3 years ago

ArielPaulson commented 3 years ago

Hi again,

I just noticed that the fastq _1 vs _2 detection system is too general. If my fastq names contian the string "_1" anywhere, like "SPT399_10M_1.fastq.gz" and "SPT399_10M_2.fastq.gz", both will be detected as end-1 fastqs because of the "_10M". Then the pipeline stops because there are twice as many end-1s vs end-2s. So the grep should probably be extended to "_1.(fq|fastq).gz$" or something.

Thanks, Ariel

ArielPaulson commented 3 years ago

Also I got a new error "Exit: Conflict in file names. PAIR1_EXT/PAIR2_EXT detected in REFERENCE_GENOME. Please correct before running. Exit".

This is most likely because the genome name contains "_2.0", but this is a totally nonsensical error because there should never be any conflict between input fastqs and genomic files. These are clearly distinguished in the config file, and are not even remotely in the same location.

Does this pipeline really try to distinguish fastqs from genomic references based on greps??? Why??? This should never become an issue. Having strings like "_1" or "_2" is quite common in genome names anyway.

Thanks, Ariel