shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.29k stars 157 forks source link

Trim a multiple sequence alignment based on a BED #477

Open mbhall88 opened 3 months ago

mbhall88 commented 3 months ago

As always, I assume there is a way to do this seqkit, but my fu is not strong enough to find it.

I have a BED file of positions/columns I would like to keep from a multiple sequence alignment.

I've tried seqkit subset -w 0 -U --bed keep.bed msa.fa > msa.trim.fa but I get the warning [WARN] sequence () not found in file: msa.fa and the resulting output is empty. I assume this is because the chromosome name in the BED does not match the names of the sequences in the MSA.

Esentially I want to ignore the chromosome name and just extract the positions in the BED from each sequence in the MSA.

shenwei356 commented 3 months ago

How about trim via a region -r ?

mbhall88 commented 3 months ago

But the region option only takes a single region. I’d like to do multiple regions at once. maybe a flag to ignore chromosome name in a bed, or pass a regions file?

shenwei356 commented 2 months ago

Calling seqkit subseq -r multiple times.

Even if using the bed with chr, the results would be strange.

mbhall88 commented 2 months ago

if I have thousands of regions though this would be quite cumbersome no?

For example, this is a small python script I used to achieve the functionality I am asking about

import sys

def trim_sequence(sequence: str, positions: list[range]) -> str:
    return "".join([sequence[iv] for iv in positions])

def main():
    bed_file = sys.argv[1]
    msa_file = sys.argv[2]
    # Read the bed file and extract the positions
    positions = set()
    with open(bed_file, "r") as bed:
        for line in bed:
            if not line.strip():
                continue

            fields = line.strip().split("\t")
            start, end = int(fields[1]), int(fields[2])
            positions.update(range(start, end))

    # Trim each sequence in the MSA
    with open(msa_file, "r") as msa:
        seq = ""
        for line in msa:
            if not line.strip():
                continue

            if line.startswith(">"):
                if seq:
                    trimmed_sequence = trim_sequence(seq, positions)
                    print(trimmed_sequence, file=sys.stdout)
                seq = ""
                sys.stdout.write(line)
            else:
                seq += line.strip()

        if seq:
            trimmed_sequence = trim_sequence(seq, positions)
            print(trimmed_sequence, file=sys.stdout)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python trim_msa.py <bed_file> <msa_file>", file=sys.stderr)
        sys.exit(1)
    main()