liumz93 / PEM-Q

a pipeline to process data of PEM-seq or data similar, which is more comprehensive than superQ
7 stars 6 forks source link

new issues with PEM-Q.py #7

Open aprilW0829 opened 1 week ago

aprilW0829 commented 1 week ago

I encountered the following error when I test example data. Is there any solution to this error? Thanks.

[PEM-Q] primerChrom: chr15 [PEM-Q] primer_start: 61986633 [PEM-Q] primer_end: 61986652 [PEM-Q] primer_strand: + [PEM-Q] primer: GGAAACCAGAGGGAATCCTC [PEM-Q] adapter: CCACGCGTGCTCTACA [PEM-Q] genome: /data2/wangxin/database/bwa/bwa_mm10/mm10 [PEM-Q] fastq_r1: CC055c_R1.fq.gz [PEM-Q] fastq_r2: CC055c_R2.fq.gz [PEM-Q] your adapter sequence: CCACGCGTGCTCTACA [PEM-Q] align to adapter... [FLASH] WARNING: An unexpectedly high proportion of combined pairs (53.72%) overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider increasing this parameter. (As-is, FLASH is penalizing overlaps longer than 65 bp when considering them for possible combining!) [PEM-Q] stitching reads using FLASh... [PEM-Q] bwa mem -t 8 adapter/adapter -k 10 -L 0 -T 10 CC055c_R2.fq.gz > bwa_align/CC055c_sti.adpt.sam 2>bwa_align/bwa_align_adapter.log [PEM-Q] sort and index bam... [PEM-Q] merging fastq files... mkdir: cannot create directory ‘bwa_align’: File exists [PEM-Q] index file used None//data2/wangxin/database/bwa/bwa_mm10/mm10//data2/wangxin/database/bwa/bwa_mm10/mm10 [PEM-Q] align to genome... [PEM-Q] bwa mem -Y -t 8 None//data2/wangxin/database/bwa/bwa_mm10/mm10//data2/wangxin/database/bwa/bwa_mm10/mm10 flash_out/CC055c.merge.fastq.gz > bwa_align/CC055c_sti.sam 2>bwa_align/bwa_alignstich.log [PEM-Q] filter no_primer reads... [PEM-Q] Your primer sequence: GGAAACCAGAGGGAATCCTC [PEM-Q] primer position: chr15 1 61986633 61986652 Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 536, in main() File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 529, in main alignment.no_primer_filter() File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 330, in no_primer_filter bam_file = pysam.AlignmentFile('bwa_align/'+self.bam_sort, 'rb') File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

liumz93 commented 1 week ago

Hello, thank you for your feedback. Based on the error message, it seems that the alignment file was not generated correctly. The alignment tool I am using is BWA, and the version I recently tested is 0.7.18-r1243-dirty. However, in general, the version of BWA should not affect the program's execution. My suggestion is that you first check the log files under data/bwa_align to see if there are more detailed error messages for further debugging.

aprilW0829 commented 1 week ago

Thank you for the quick reply, according to your recommended solution direction, I can already perform the PEM-Q.py to complete the fundamental analysis, but I seem to have encountered another problem as follows when testing the second step vector_analysis.py:

my code: vector_analyze.py CC055c pX330_SpCas9.fa bwa_mm10 chr1 + 7937 7956 error as follows: ‍[PEM-Q Vector Analysis] basename: CC055c [PEM-Q Vector Analysis] vector_fa: pX330_SpCas9.fa [PEM-Q Vector Analysis] genome: /data2/wangxin/database/bwa/bwa_mm10/bwa_mm10 [PEM-Q Vector Analysis] bait_chr: chr1 [PEM-Q Vector Analysis] bait_strand: + [PEM-Q Vector Analysis] sgRNA_start: 7937 [PEM-Q Vector Analysis] sgRNA_end: 7956 [PEM-Q Vector Analysis]seqtk subseq CC055c_R1.fq.gz indel/CC055c_discard.tab > CC055c_discard_R1.fq sh: seqtk: command not found [PEM-Q Vector Analysis]seqtk subseq CC055c_R2.fq.gz indel/CC055c_discard.tab > CC055c_discard_R2.fq sh: seqtk: command not found mkdir: cannot create directory ‘pX330_SpCas9’: File exists mkdir: cannot create directory ‘vector’: File exists [PEM-Q Vector Analysis] check file... [PEM-Q Vector Analysis] building vector index... [PEM-Q Vector Analysis] align pe_fq to vector... bwa mem -t 8 pX330_SpCas9/pX330_SpCas9 CC055c_discard_R1.fq CC055c_discard_R2.fq > pX330_SpCas9/CC055c_pe_vector.sam 2>pX330_SpCas9/bwa_align_pe_vector.log [PEM-Q Vector Analysis] sort and index bam... samtools view -S -b -h pX330_SpCas9/CC055c_pe_vector.sam > pX330_SpCas9/CC055c_pe_vector.bam && samtools sort pX330_SpCas9/CC055c_pe_vector.bam > pX330_SpCas9/CC055c_pe_vector.sort.bam && samtools index pX330_SpCas9/CC055c_pe_vector.sort.bam [PEM-Q Vector Analysis] align r2 to genome... bwa mem -t 8 -k 10 /home/mengzhu/database/bwa_indexes//data2/wangxin/database/bwa/bwa_mm10/bwa_mm10//data2/wangxin/database/bwa/bwa_mm10/bwa_mm10 CC055c_discard_R2.fq > pX330_SpCas9/CC055c_r2_genome.sam 2>pX330_SpCas9/bwa_align_pe_vector.log [PEM-Q Vector Analysis] sort and index bam... samtools view -S -b -h pX330_SpCas9/CC055c_r2_genome.sam > pX330_SpCas9/CC055c_r2_genome.bam && samtools sort pX330_SpCas9/CC055c_r2_genome.bam > pX330_SpCas9/CC055c_r2_genome.sort.bam && samtools index pX330_SpCas9/CC055c_r2_genome.sort.bam [PEM-Q Vector Analysis] processing primer filter... primer filter left: 0 [PEM-Q Vector Analysis] generating proper pair tab... paired: 0 r1: 0 r2: 0 [E::idx_find_and_load] Could not retrieve index file for 'pX330_SpCas9/CC055c_r1.paired.sort.bam' [E::idx_find_and_load] Could not retrieve index file for 'pX330_SpCas9/CC055c_r2.paired.sort.bam' Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 433, in main() File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 428, in main proper_pair_tab(**kwargs) File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 221, in proper_pair_tab r2_genome_bam = pysam.AlignmentFile(directory_store+"/"+basename+"_r2_genome.sort.bam",'rb') File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

I tried to check the pX330_SpCas9/bwa_align_pe_vector.log files following the instructions you made last referral , and it still shows [E::bwa_idx_load_from_disk] fail to locate the index files, but the successful results of the previous step can already show that the established genome mm10 index is not problematic, I don't know what the reason is here.

Looking forward to your reply, I would appreciate it . Best wishes! wangxin970829

@. | ---- Replied Message ---- | From | Mengzhu @.> | | Date | 10/10/2024 01:18 | | To | @.> | | Cc | @.>, @.***> | | Subject | Re: [liumz93/PEM-Q] new issues with PEM-Q.py (Issue #7) |

Hello, thank you for your feedback. Based on the error message, it seems that the alignment file was not generated correctly. The alignment tool I am using is BWA, and the version I recently tested is 0.7.18-r1243-dirty. However, in general, the version of BWA should not affect the program's execution. My suggestion is that you first check the log files under data/bwa_align to see if there are more detailed error messages for further debugging.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

liumz93 commented 1 week ago

Glad to hear that you successfully ran PEM-Q. Regarding the vector analysis, you first need to install a dependency package called seqtk (https://github.com/lh3/seqtk). Since vector insertion events are relatively rare, and the test FASTQ file I provided only includes a small subset of reads, I recommend downloading the complete sequencing file from this link: https://www.biosino.org/node/sample/detail/OES00075922 for a more comprehensive vector analysis.

Additionally, I rechecked the code and noticed a path error in line 127. The original path "/home/mengzhu/database/bwa_indexes/" is my old path, I've modified both tools/align_inser_va.py and vector_analyze.py. Please download the data and the latest code and re-run the analysis.

Thanks! L.