joyhughes / libqsearch-clean

cleaner C++ port of libqsearch
1 stars 0 forks source link

Smart sequence file auto-cleaning #8

Open joyhughes opened 1 month ago

joyhughes commented 1 month ago

Right now (as of PR #5) when a sequence is loaded via drag and drop the first line is ignored, corresponding to FASTA data format. This will not work for all formats - default should probably be to compare the whole file, with auto-cleaning happening if we know the format.

joyhughes commented 1 month ago

From @rudi-cilibrasi on Discord: some small notes on FASTA specifically the mitochondrial full genome we are fetching from GenBank:

  1. we already strip off the first line. this is good. we also need to
  2. convert everything to lowercase and
  3. remove all characters that are not in the set {a,c,g,t}
  4. throw away any sequences that are < 10k or > 20k in size after these transformations. the reason for step 4 is because some sequences are misfiled and uncorrected in GenBank. they are filed as full genome but actually not full mito. so it is a little runtime data cleaning after the fetch