Open wyim-pgl opened 3 years ago
Hi @sc-zhang , could you please help our friend Won out? Thanks!
Dear wyim-pgl and Xingtan, I also met the same erro when I use partition_gmap.py to separate chromosome group for fasta and BAM, 2 chromosomes didn't have any bam file but fasta file. I'm wondering how to solve this problem. I hope you can help me to solve this problem.
@lilyynu please check the chromosomes are appeared in the allele_table, and the name of chromosome not start with "tig", "scaffold", "utg", "ctg". If so, rename it or modify the partition_gmap.py with line 41, remove the prefix which you do not need filter.
if chrn.startswith('tig') or chrn.startswith('scaffold') or chrn.startswith('utg') or chrn.startswith('ctg'):
@sc-zhang I have checked my Allele.ctg.table, and the name of chromosome start with "ptg". Finnaly, I have modify the partition_gmap.py at line 41 like " if chrn.startswith('tig') or chrn.startswith('scaffold') or chrn.startswith('utg') or chrn.startswith('ctg') or chrn.startswith('ptg'):", but it not works. I also try to change all "ctg" to "ptg" in partition_gmap.py, it also not works, the chrom5 and chrom8 still only contain no bam file only fasta file retained.
Before partition, I sort the prunning.bam file with "samtools sort -@ 30 prunning.bam -o prunning.sort.bam", I have no idea how to solve this erro.
I generated the Allele.ctg.table by the method of GMAP, and my species is a diploid.
The form of Allele.ctg.table like this:
Hic_asm_8 7556 ptg000020l_np1212 ptg000039l_np1212
Hic_asm_8 8462 ptg000037l_np1212
Hic_asm_8 26493 ptg000108l_np1212
Hic_asm_8 41792 ptg000108l_np1212
Hic_asm_8 42651 ptg000108l_np1212
Hic_asm_8 82682 ptg000108l_np1212
Hic_asm_8 88782 ptg000108l_np1212
Hic_asm_8 93434 ptg000108l_np1212
The sorted prunning.bam looks like this:
(reseq) [lili@localhost GMAP]$ samtools view prunning.sort.bam | head | cut -f 1,2,3,4,5,6,7,8,9
A00821:973:HYHYWDSX2:1:2114:21829:27806 161 ptg000011l_np1212 6754 37 150M ptg000324l_np1212 23895 0
A00920:929:H5YTKDSX2:3:2528:26964:31344 163 ptg000011l_np1212 13629 37 150M = 13878 399
A00920:929:H5YTKDSX2:3:1449:17806:24909 81 ptg000011l_np1212 13662 37 150M ptg000283l_np1212 186606 0
A00920:929:H5YTKDSX2:3:1262:1009:20369 65 ptg000011l_np1212 13666 37 150M ptg000006l_np1212 261558 0
A00920:929:H5YTKDSX2:3:1637:6903:33489 163 ptg000011l_np1212 21150 37 150M = 21305 305
A00920:929:H5YTKDSX2:3:1531:22978:23030 161 ptg000011l_np1212 21168 37 150M ptg000294l_np1212 11928 0
A00920:929:H5YTKDSX2:3:1355:5629:1470 161 ptg000011l_np1212 21188 37 150M ptg000244l_np1212 85530 0
A00920:929:H5YTKDSX2:3:1476:5330:35023 65 ptg000011l_np1212 35013 37 150M ptg000206l_np1212 4078 0
A00920:929:H5YTKDSX2:3:1552:8892:19523 65 ptg000011l_np1212 35038 37 150M ptg000206l_np1212 4098 0
A00920:929:H5YTKDSX2:3:1652:28926:18208 97 ptg000011l_np1212 35038 37 150M ptg000269l_np1212 22770 0
I'm looking forward that you can help me to solve this erro. Best regards!
@lilyynu Sorry for my unclear answer, I mean the first column of allele table is the name of chromosomes and if they start with prefix among "tig", "scaffold", "utg", "ctg", it will be skipped, so you can remove the prefix in line 47 of partition_gmap.py that make it can load records start with "tig", "scaffold", "utg" or "ctg". But as the allele table which you paste here, the reason of why the bam file of chromosome 5 and 8 cannot be generated is not what I thought before. So, please restore the partition_gmap.py to the origin one, and write the codes below to "check_bam.py", and use "python check_bam.py chrom8.fasta prunning.sort.bam" and check if there is any pair of contigs print.
#!/usr/bin/env python
import sys
import pysam
def check_bam(in_fa, in_bam):
id_set = set()
with open(in_fa, 'r') as fin:
for line in fin:
if line[0] == '>':
id = line.strip().split()[0][1:]
id_set.add(id)
with pysam.AlignmentFile(in_bam, 'rb') as fin:
for ctg in id_set:
for line in fin.fetch(contig=ctg):
if line.next_reference_name and line.next_reference_name in id_set:
print(ctg, line.next_reference_name)
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: python %s <in_fa> <in_bam>"%sys.argv[0])
else:
in_fa, in_bam = sys.argv[1:]
check_bam(in_fa, in_bam)
@sc-zhang Thanks for your fast reply! I changed the script "id_set = {}" at line 2 to "id_set = set()", then the script works. Finnally, I got the result below: (python27) [lili@localhost GMAP]$ python check_bam.py Hic_asm_8.fa prunning.sort.bam ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') Is the result mean pair of contigs was printed? How could I solve this problem?
@sc-zhang Thanks for your fast reply! I changed the script "id_set = {}" at line 2 to "id_set = set()", then the script works. Finnally, I got the result below: (python27) [lili@localhost GMAP]$ python check_bam.py Hic_asm_8.fa prunning.sort.bam ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') Is the result mean pair of contigs was printed? How could I solve this problem?
@lilyynu maybe you can write the code below to extract.py and run it with command "python extract.py chrom.fasta prunned.bam out.bam", and check if the output bam is empty or not
#!/usr/bin/env python
import sys
import pysam
def extract_bam(in_fa, in_bam, out_bam):
id_set = set()
with open(in_fa, 'r') as fin:
for line in fin:
if line[0] == '>':
id = line.strip().split()[0][1:]
id_set.add(id)
with pysam.AlignmentFile(in_bam, 'rb') as fin:
with pysam.AlignmentFile(out_bam, 'wb', template=fin) as fout:
for ctg in id_set:
for line in fin.fetch(contig=ctg):
if line.next_reference_name and line.next_reference_name in id_set:
fout.write(line)
if __name__ == "__main__":
if len(sys.argv) < 4:
print("Usage: python %s <in_fa> <in_bam> <out_bam>"%sys.argv[0])
else:
in_fa, in_bam, out_bam = sys.argv[1:]
extract_bam(in_fa, in_bam, out_bam)
@sc-zhang I change the "sub_bam" to "out_bam" in your code at Line 14, and the code works. finnaly, I got the bam file of the chrom5 and chrom8. Thanks for your help. Best regards!
I also met the problem today.
The script partition_gmap.py
implements a multi-processing approach to separate homologous groups with the module multiprocessing
. However, it does not catch any exceptions raised by sub-processes. Consequently, if an error occurs in a sub-process (e.g., a sequence in the FASTA file does not match the @SQ
lines in the BAM file), the program may output an incomplete or empty BAM file without an explicit error report.
I‘ll fix the problem soon.
Dear Xingtan, When I use partition_gmap.py to separate chromosome group for fasta and BAM, some chromosomes didn't have any bam file but fasta file. Do you have any ideas for it or anyway to check it quickly? Thanks.