kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

minimal test dataset #41

Open kfuku52 opened 3 years ago

kfuku52 commented 3 years ago

To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from metadata to curate with it. Ideally,

Hego-CCTB commented 3 years ago

One thing of not here is that currently, curate wants tissues as input. While condition or strain would be functionally equal to tissue, we'll need to adjust curate to look at different columns if prompted to do so.

kfuku52 commented 3 years ago

Good point. How about adding a new column such as curate_group in the metadata table? tissue can be copied as default values but users can manually modify it to include other categories such as treatment, sex, genotype...whatever they want.

kfuku52 commented 3 years ago

... and, of course, amalgkit curate uses curate_group instead of tissue.

Hego-CCTB commented 3 years ago

Adding curate_group should probably be done all the way back in amalgkit metadata. Also definitely needs an explanation in the wiki.

kfuku52 commented 3 years ago

Right, could you do it?

Hego-CCTB commented 3 years ago

on it right now, should be a quick adjustment! I'll probably add the curate_group column during or after the group_tissues_by_config call and just copy the tissue column over.

Hego-CCTB commented 3 years ago

Amalgkit version 0.5.1.0:

kfuku52 commented 3 years ago

Thank. Please describe the default behavior of --curate_group. I assume all values will be included by default.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB commented 3 years ago

amalgkit curate --curate_group is identical to how amalgkit curate --tissue worked. It was just renamed to avoid confusion. A typical command would look like: amalgkit curate --curate_group "root,flower,leaf" [additional arguments]

within the r script, this input string will be split and read into a vector selected_tissues, which then gets passed to the main algorithm for example to check_whithin_tissue_correlation.

EDIT: to add to this, --curate_group (like --tissues) does not have a default input, but is required. Theoretically, it would be possible to read selected tissues/conditions/whatever from this column as default, but this can cause all kinds of problems, especially when the metadata sheet contains data from multiple species, or has typos in the column, unused SRR entries, etc.

kfuku52 commented 3 years ago

OK, it makes sense to require --curate_group. Could you describe it in the option? Currently, it's not clear enough (see below). You can provide an example, otherwise, users cannot even know what separator they should use.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB commented 3 years ago

Yeah, that's fair. What about:

"comma separated list of values contained in the curate_group metadata column to be included in the analysis. Example input may look like "root,flower,leaf" or "heat stressed,cold stressed,light stressed".

kfuku52 commented 3 years ago

Looks good!

Hego-CCTB commented 3 years ago

Updated in Ver. 0.5.1.2! https://github.com/kfuku52/amalgkit/commit/857974951b3bd74cf5a9ec39dc436475f7912dc9

kfuku52 commented 3 years ago

@Hego-CCTB Please add any other factors which we should take into account in an ideal test dataset. I'll look for it when I have time.

  • small file size of .sra: some bacterial dataset?
  • multiple BioProjects: 2 or 3?
  • 2 species
  • reference transcriptome fasta files are downloadable, maybe from amalgkit repo
  • pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo
Hego-CCTB commented 3 years ago

I've been looking for some bacterial sets the other day. With various combinations of

with stresses or specific antibiotics as condition. I was surprised to see that there weren't that many RNAseq experiments. E.coli produced a couple of hits when running metadata, but the other 2 species didn't have much to offer.

kfuku52 commented 3 years ago

Could you share a summary (maybe a table?) of your survey?

Hego-CCTB commented 3 years ago

Test_data_quick_survey.zip Here is the last amalgkit metadata run I did, along with a summary metadata.tsv. Keywords were: stress, antibiotics, tetracycline. The species were the tree I mentioned in the above comment.

I did not anticipate all the different strains, which could be a different problem. In the summary I put in some possible candidate samples, which followed these criteria:

The best I could find was anaerobic/hypoxia stress. Escherichia coli and Mycobacterium Tuberculosis had 2 bioprojects for both species for anaerobic/hypoxia stress. Although it might be a stretch to put anaerobic into the same category as hypoxia.

kfuku52 commented 3 years ago

Thank you. E. coli looks promising as expected. I'll search for other species that are suitable for the comparison.

kfuku52 commented 1 year ago

@Hego-CCTB I will take care of it if you don't have time.

Hego-CCTB commented 1 year ago

Yes, please help me out with this issue!

Hego-CCTB commented 8 months ago

I'd like to create a full bacterial dataset for the paper this week, so we may just be able to use a subset for this issue.