shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
54 stars 12 forks source link

Adding an ID blacklist file #111

Open linsalrob opened 3 months ago

linsalrob commented 3 months ago

I have added a draft blacklist file, but I am not 100% convinced this is the right format or the right location. However, I wanted to ensure that we record the work I am doing in checking these sequences, and this seemed a good starting point!

The blacklist file includes sequences in our database (currently all in the virus_secondary_aa/sequenceDB database). You can extract that database to .fasta:

mmseqs convert2fasta databases/aa/virus_secondary_aa/sequenceDB secondary_aa.faa

and then find these sequences using grep. I recommend;

grep --no-group-separator -A1 G0W2I5

This will list all versions of the sequence with ID G0W2I5 in the file, in fasta format, and then you can paste them into the NCBI blast website and search them all in one go.

I have used YAML format for this blacklist file. I chose YAML because it is simple, well supported, and many of the other config files in hecatomb use YAML format.

The data is called blacklist_ids, and you can read it in python so:

import yaml
with open("blacklist_ids.yaml", 'r') as file:
    data = yaml.safe_load(file)
for id in data['blacklist_ids']:
    print((id)

or in R like so:

library(yaml)
data <- yaml.load_file('blacklist_ids.yaml')
print(data)

Entries in the blacklist_ids have the following attributes:

  1. each entry is identified by the sequence ID
  2. there is an id field that also contains the sequence ID. This is redundant, and maybe we should drop it.
  3. the header field provides the whole header from the mmseqs database for this entry. Note that some IDs map to more than one entry, and only one of them maybe incorrect. Currently, we can't map from result ID to entry, but this information will likely help cleaning up the data later.
  4. The reason field is a note about why this sequence has been blacklisted.

Feel free to nominate other fields that should be added, because I envision that this resource will continue to be used, and so I'd like to future-proof it.

beardymcjohnface commented 3 months ago

Ideally in the future this is something we handle with new releases of the database, but this list will be a great sanity check for a high-throughput method for filtering the databases.