tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

partition_gmap.py without bam but with fasta file #100

Open wyim-pgl opened 3 years ago

wyim-pgl commented 3 years ago

Dear Xingtan, When I use partition_gmap.py to separate chromosome group for fasta and BAM, some chromosomes didn't have any bam file but fasta file. Do you have any ideas for it or anyway to check it quickly? Thanks.

tangerzhang commented 3 years ago

Hi @sc-zhang , could you please help our friend Won out? Thanks!

lilyynu commented 2 years ago

Dear wyim-pgl and Xingtan, I also met the same erro when I use partition_gmap.py to separate chromosome group for fasta and BAM, 2 chromosomes didn't have any bam file but fasta file. I'm wondering how to solve this problem. I hope you can help me to solve this problem.

sc-zhang commented 2 years ago

@lilyynu please check the chromosomes are appeared in the allele_table, and the name of chromosome not start with "tig", "scaffold", "utg", "ctg". If so, rename it or modify the partition_gmap.py with line 41, remove the prefix which you do not need filter. if chrn.startswith('tig') or chrn.startswith('scaffold') or chrn.startswith('utg') or chrn.startswith('ctg'):

lilyynu commented 2 years ago

@sc-zhang I have checked my Allele.ctg.table, and the name of chromosome start with "ptg". Finnaly, I have modify the partition_gmap.py at line 41 like " if chrn.startswith('tig') or chrn.startswith('scaffold') or chrn.startswith('utg') or chrn.startswith('ctg') or chrn.startswith('ptg'):", but it not works. I also try to change all "ctg" to "ptg" in partition_gmap.py, it also not works, the chrom5 and chrom8 still only contain no bam file only fasta file retained. Before partition, I sort the prunning.bam file with "samtools sort -@ 30 prunning.bam -o prunning.sort.bam", I have no idea how to solve this erro. I generated the Allele.ctg.table by the method of GMAP, and my species is a diploid. The form of Allele.ctg.table like this: Hic_asm_8 7556 ptg000020l_np1212 ptg000039l_np1212
Hic_asm_8 8462 ptg000037l_np1212
Hic_asm_8 26493 ptg000108l_np1212
Hic_asm_8 41792 ptg000108l_np1212
Hic_asm_8 42651 ptg000108l_np1212
Hic_asm_8 82682 ptg000108l_np1212
Hic_asm_8 88782 ptg000108l_np1212
Hic_asm_8 93434 ptg000108l_np1212 The sorted prunning.bam looks like this: (reseq) [lili@localhost GMAP]$ samtools view prunning.sort.bam | head | cut -f 1,2,3,4,5,6,7,8,9 A00821:973:HYHYWDSX2:1:2114:21829:27806 161 ptg000011l_np1212 6754 37 150M ptg000324l_np1212 23895 0 A00920:929:H5YTKDSX2:3:2528:26964:31344 163 ptg000011l_np1212 13629 37 150M = 13878 399 A00920:929:H5YTKDSX2:3:1449:17806:24909 81 ptg000011l_np1212 13662 37 150M ptg000283l_np1212 186606 0 A00920:929:H5YTKDSX2:3:1262:1009:20369 65 ptg000011l_np1212 13666 37 150M ptg000006l_np1212 261558 0 A00920:929:H5YTKDSX2:3:1637:6903:33489 163 ptg000011l_np1212 21150 37 150M = 21305 305 A00920:929:H5YTKDSX2:3:1531:22978:23030 161 ptg000011l_np1212 21168 37 150M ptg000294l_np1212 11928 0 A00920:929:H5YTKDSX2:3:1355:5629:1470 161 ptg000011l_np1212 21188 37 150M ptg000244l_np1212 85530 0 A00920:929:H5YTKDSX2:3:1476:5330:35023 65 ptg000011l_np1212 35013 37 150M ptg000206l_np1212 4078 0 A00920:929:H5YTKDSX2:3:1552:8892:19523 65 ptg000011l_np1212 35038 37 150M ptg000206l_np1212 4098 0 A00920:929:H5YTKDSX2:3:1652:28926:18208 97 ptg000011l_np1212 35038 37 150M ptg000269l_np1212 22770 0

I'm looking forward that you can help me to solve this erro. Best regards!

sc-zhang commented 2 years ago

@lilyynu Sorry for my unclear answer, I mean the first column of allele table is the name of chromosomes and if they start with prefix among "tig", "scaffold", "utg", "ctg", it will be skipped, so you can remove the prefix in line 47 of partition_gmap.py that make it can load records start with "tig", "scaffold", "utg" or "ctg". But as the allele table which you paste here, the reason of why the bam file of chromosome 5 and 8 cannot be generated is not what I thought before. So, please restore the partition_gmap.py to the origin one, and write the codes below to "check_bam.py", and use "python check_bam.py chrom8.fasta prunning.sort.bam" and check if there is any pair of contigs print.

#!/usr/bin/env python
import sys
import pysam

def check_bam(in_fa, in_bam):
    id_set = set()
    with open(in_fa, 'r') as fin:
        for line in fin:
            if line[0] == '>':
                id = line.strip().split()[0][1:]
                id_set.add(id)

    with pysam.AlignmentFile(in_bam, 'rb') as fin:
        for ctg in id_set:
            for line in fin.fetch(contig=ctg):
                if line.next_reference_name and line.next_reference_name in id_set:
                    print(ctg, line.next_reference_name)

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python %s <in_fa> <in_bam>"%sys.argv[0])
    else:
        in_fa, in_bam = sys.argv[1:]
        check_bam(in_fa, in_bam)
lilyynu commented 2 years ago

@sc-zhang Thanks for your fast reply! I changed the script "id_set = {}" at line 2 to "id_set = set()", then the script works. Finnally, I got the result below: (python27) [lili@localhost GMAP]$ python check_bam.py Hic_asm_8.fa prunning.sort.bam ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') Is the result mean pair of contigs was printed? How could I solve this problem?

lilyynu commented 2 years ago

@sc-zhang Thanks for your fast reply! I changed the script "id_set = {}" at line 2 to "id_set = set()", then the script works. Finnally, I got the result below: (python27) [lili@localhost GMAP]$ python check_bam.py Hic_asm_8.fa prunning.sort.bam ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') ('ptg000224l_np1212', 'ptg000224l_np1212') Is the result mean pair of contigs was printed? How could I solve this problem?

sc-zhang commented 2 years ago

@lilyynu maybe you can write the code below to extract.py and run it with command "python extract.py chrom.fasta prunned.bam out.bam", and check if the output bam is empty or not

#!/usr/bin/env python
import sys
import pysam

def extract_bam(in_fa, in_bam, out_bam):
    id_set = set()
    with open(in_fa, 'r') as fin:
        for line in fin:
            if line[0] == '>':
                id = line.strip().split()[0][1:]
                id_set.add(id)

    with pysam.AlignmentFile(in_bam, 'rb') as fin:
        with pysam.AlignmentFile(out_bam, 'wb', template=fin) as fout:
            for ctg in id_set:
                for line in fin.fetch(contig=ctg):
                    if line.next_reference_name and line.next_reference_name in id_set:
                        fout.write(line)

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python %s <in_fa> <in_bam> <out_bam>"%sys.argv[0])
    else:
        in_fa, in_bam, out_bam = sys.argv[1:]
        extract_bam(in_fa, in_bam, out_bam)
lilyynu commented 2 years ago

@sc-zhang I change the "sub_bam" to "out_bam" in your code at Line 14, and the code works. finnaly, I got the bam file of the chrom5 and chrom8. Thanks for your help. Best regards!

zengxiaofei commented 1 year ago

I also met the problem today. The script partition_gmap.py implements a multi-processing approach to separate homologous groups with the module multiprocessing. However, it does not catch any exceptions raised by sub-processes. Consequently, if an error occurs in a sub-process (e.g., a sequence in the FASTA file does not match the @SQ lines in the BAM file), the program may output an incomplete or empty BAM file without an explicit error report. I‘ll fix the problem soon.