karel-brinda / Phylign

Alignment against all pre-2019 bacteria on laptops within a few hours (former MOF-Search)
http://brinda.eu/mof
Other
25 stars 4 forks source link

new command 'make label' assign a label to query draft assembly based on the best hits from COBS #263

Open jorgeavilacartes opened 6 months ago

jorgeavilacartes commented 6 months ago

Hello,

In this pull request, I included:

  1. a modification of the Snakefile to support a larger number of input files. Why? when using ~400 files, the code crashed because the concatenation of their names was too long. So I simply modified the get_filename_for_all_queries() function to return a fixed string. See here
  2. "fna" was included in the list of accepted extensions, since this is the default format of assemblies downloaded from NCBI (with ncbi-datasets).
  3. scripts and files to assign a label to a query draft assembly at the species level,

How are labels assigned to a query draft assembly? Since each contig in a draft assembly is considered as a query, I parsed the output file from intermediate/04_filter to collect all hits of each assembly (i.e. the collection of hits of its contigs).
Each hit (represented by the sampleID of an assembly) is mapped to its label, and the label assigned to the query assembly corresponds to the most common label of its hits.

The labels correspond to the second column of the Kraken Braken (most abundant species) file that was used to create the clusters. The file data/labels_krakenbracken_by_sampleid.txt was included in the repository.

NOTE: these modifications do not interfere with the main pipeline, since it can be run after make match. See updated README