kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Index file detection in amalgkit sanity #72

Closed kfuku52 closed 2 years ago

kfuku52 commented 3 years ago
Looking for Index file ./Index/Arabidopsis_thaliana* for species:  Arabidopsis thaliana
Found  ['./Index/Arabidopsis_thaliana_Athaliana-167-cds-primaryTranscriptOnly.fa'] !

I'm not sure what sanity does here. It says that the index is found and shows the PATH to reference fasta I put in amalgkit_out/Index. Should fasta be in another directory?

Also, help messages are not user-friendly. In many cases, "Index" does not make sense if you don't mention the "reference fasta".

Minor point: Does the default "Index" dir has to be the title case even though all other subdirs do not have a capitalized first letter?

Hego-CCTB commented 3 years ago

Yeah, this functionality needs some love, I agree. The way it works is it looks for an Index folder (has to be capitlized) in the working directory and tries to find any file that starts with the Genus_species prefix. But that's where it stops. The assumption is that there should only be index files in the Index directory, so they should be already run through kallisto index or something comparable.

At some point we obsoleted an index builder through amalgkit, so that's something the user has to do on their own before running quant.

I'll make the messages clearer.

I wonder if there is a way to verify if something is an actual index file and not a random text or fasta file that found its way into the Index folder.

kfuku52 commented 3 years ago

index builder

I missed that functionality, and probably we should keep it rather than obsoleting it. Could you add a wiki page for the usage? I'll give it a try and send you my feedback.

Hego-CCTB commented 3 years ago

index builder

I missed that functionality, and probably we should keep it rather than obsoleting it. Could you add a wiki page for the usage? I'll give it a try and send you my feedback.

we obsoleted it, because all it did was calling kallisto index. There was no automatic index building in place at the time and there was no functional difference between running kallisto index and amalgkit quant --index_build.

But that was long before using metadata for everything amalgkit. But now an automatic index builder is possible, since metadata is a required input anyways.

I'll look into this!

kfuku52 commented 3 years ago

OK, please fix/update amalgkit for this issue if needed and write an instruction for indexing.

Hego-CCTB commented 2 years ago

Bumb for myself!

Hego-CCTB commented 2 years ago

I have reintroduced index building to quant. The way this works is the following: --fasta_dir PATH indicates folder containing fasta files with species names, required for building index --build_index yes|no enables functionality

goes through quant as normal. If --build_index yes, quant will look for the currently processed species according to the metadata file in the index directory first. If there is no available index for that species,quant will try to find a matching fasta file in the --fasta_dir and start building the index. quant will then immediately go on quantifying samples as normal.

https://github.com/kfuku52/amalgkit/commit/6f7971b83bb6be07b2c7de9922fcf521c13e94ca and https://github.com/kfuku52/amalgkit/commit/23c824c750dd4bcc540fc2a68144bf7faa62449d (safer fasta file detection)