Open mbhall88 opened 3 months ago
How about trim via a region -r ?
But the region option only takes a single region. I’d like to do multiple regions at once. maybe a flag to ignore chromosome name in a bed, or pass a regions file?
Calling seqkit subseq -r
multiple times.
Even if using the bed with chr, the results would be strange.
if I have thousands of regions though this would be quite cumbersome no?
For example, this is a small python script I used to achieve the functionality I am asking about
import sys
def trim_sequence(sequence: str, positions: list[range]) -> str:
return "".join([sequence[iv] for iv in positions])
def main():
bed_file = sys.argv[1]
msa_file = sys.argv[2]
# Read the bed file and extract the positions
positions = set()
with open(bed_file, "r") as bed:
for line in bed:
if not line.strip():
continue
fields = line.strip().split("\t")
start, end = int(fields[1]), int(fields[2])
positions.update(range(start, end))
# Trim each sequence in the MSA
with open(msa_file, "r") as msa:
seq = ""
for line in msa:
if not line.strip():
continue
if line.startswith(">"):
if seq:
trimmed_sequence = trim_sequence(seq, positions)
print(trimmed_sequence, file=sys.stdout)
seq = ""
sys.stdout.write(line)
else:
seq += line.strip()
if seq:
trimmed_sequence = trim_sequence(seq, positions)
print(trimmed_sequence, file=sys.stdout)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python trim_msa.py <bed_file> <msa_file>", file=sys.stderr)
sys.exit(1)
main()
As always, I assume there is a way to do this seqkit, but my fu is not strong enough to find it.
I have a BED file of positions/columns I would like to keep from a multiple sequence alignment.
I've tried
seqkit subset -w 0 -U --bed keep.bed msa.fa > msa.trim.fa
but I get the warning[WARN] sequence () not found in file: msa.fa
and the resulting output is empty. I assume this is because the chromosome name in the BED does not match the names of the sequences in the MSA.Esentially I want to ignore the chromosome name and just extract the positions in the BED from each sequence in the MSA.