minimal test dataset - Githubissues

kfuku52 commented 3 years ago

To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from metadata to curate with it. Ideally,

small file size of .sra: some bacterial dataset?
multiple BioProjects: 2 or 3?
2 species
reference transcriptome fasta files are downloadable, maybe from amalgkit repo
pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

Hego-CCTB commented 3 years ago

One thing of not here is that currently, curate wants tissues as input. While condition or strain would be functionally equal to tissue, we'll need to adjust curate to look at different columns if prompted to do so.

kfuku52 commented 3 years ago

Good point. How about adding a new column such as curate_group in the metadata table? tissue can be copied as default values but users can manually modify it to include other categories such as treatment, sex, genotype...whatever they want.

kfuku52 commented 3 years ago

... and, of course, amalgkit curate uses curate_group instead of tissue.

Hego-CCTB commented 3 years ago

Adding curate_group should probably be done all the way back in amalgkit metadata. Also definitely needs an explanation in the wiki.

kfuku52 commented 3 years ago

Right, could you do it?

Hego-CCTB commented 3 years ago

on it right now, should be a quick adjustment! I'll probably add the curate_group column during or after the group_tissues_by_config call and just copy the tissue column over.

Hego-CCTB commented 3 years ago

Amalgkit version 0.5.1.0:

metadata now introduces curate_group column. By default, this contains the tissue column data
curate now uses curate_group column instead of tissue
curate --tissues is now obsolete
curate --curate_group takes its place, input is unchanged https://github.com/kfuku52/amalgkit/commit/f5665c6af979aa27ded6317715aca3a40bf55755

kfuku52 commented 3 years ago

Thank. Please describe the default behavior of --curate_group. I assume all values will be included by default.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB commented 3 years ago

amalgkit curate --curate_group is identical to how amalgkit curate --tissue worked. It was just renamed to avoid confusion. A typical command would look like: amalgkit curate --curate_group "root,flower,leaf" [additional arguments]

within the r script, this input string will be split and read into a vector selected_tissues, which then gets passed to the main algorithm for example to check_whithin_tissue_correlation.

EDIT: to add to this, --curate_group (like --tissues) does not have a default input, but is required. Theoretically, it would be possible to read selected tissues/conditions/whatever from this column as default, but this can cause all kinds of problems, especially when the metadata sheet contains data from multiple species, or has typos in the column, unused SRR entries, etc.

kfuku52 commented 3 years ago

OK, it makes sense to require --curate_group. Could you describe it in the option? Currently, it's not clear enough (see below). You can provide an example, otherwise, users cannot even know what separator they should use.

List of curate_group values of the curate_group metadata column to be included

Hego-CCTB commented 3 years ago

Yeah, that's fair. What about:

"comma separated list of values contained in the curate_group metadata column to be included in the analysis. Example input may look like "root,flower,leaf" or "heat stressed,cold stressed,light stressed".

kfuku52 commented 3 years ago

Looks good!

Hego-CCTB commented 3 years ago

Updated in Ver. 0.5.1.2! https://github.com/kfuku52/amalgkit/commit/857974951b3bd74cf5a9ec39dc436475f7912dc9

kfuku52 commented 3 years ago

@Hego-CCTB Please add any other factors which we should take into account in an ideal test dataset. I'll look for it when I have time.

small file size of .sra: some bacterial dataset?

multiple BioProjects: 2 or 3?

2 species

reference transcriptome fasta files are downloadable, maybe from amalgkit repo

pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo

Hego-CCTB commented 3 years ago

I've been looking for some bacterial sets the other day. With various combinations of

Escherichia coli
Bacillus Subtilis
Mycobacterium Tuberculosis

with stresses or specific antibiotics as condition. I was surprised to see that there weren't that many RNAseq experiments. E.coli produced a couple of hits when running metadata, but the other 2 species didn't have much to offer.

kfuku52 commented 3 years ago

Could you share a summary (maybe a table?) of your survey?

Hego-CCTB commented 3 years ago

Test_data_quick_survey.zip Here is the last amalgkit metadata run I did, along with a summary metadata.tsv. Keywords were: stress, antibiotics, tetracycline. The species were the tree I mentioned in the above comment.

I did not anticipate all the different strains, which could be a different problem. In the summary I put in some possible candidate samples, which followed these criteria:

same (or at least similar) treatment in at least 2 species
minimum 2 bioprojects for each species in their respective treatments
must have untreated control sample as well
I tried to have them all be 'wildtype' too, but there would be no candidates left at all

The best I could find was anaerobic/hypoxia stress. Escherichia coli and Mycobacterium Tuberculosis had 2 bioprojects for both species for anaerobic/hypoxia stress. Although it might be a stretch to put anaerobic into the same category as hypoxia.

kfuku52 commented 3 years ago

Thank you. E. coli looks promising as expected. I'll search for other species that are suitable for the comparison.

kfuku52 commented 1 year ago

@Hego-CCTB I will take care of it if you don't have time.

Hego-CCTB commented 1 year ago

Yes, please help me out with this issue!

Hego-CCTB commented 8 months ago

I'd like to create a full bacterial dataset for the paper this week, so we may just be able to use a subset for this issue.

kfuku52 / amalgkit

minimal test dataset #41