Open kfuku52 opened 3 years ago
One thing of not here is that currently, curate
wants tissues
as input. While condition
or strain
would be functionally equal to tissue
, we'll need to adjust curate
to look at different columns if prompted to do so.
Good point. How about adding a new column such as curate_group
in the metadata table? tissue
can be copied as default values but users can manually modify it to include other categories such as treatment, sex, genotype...whatever they want.
... and, of course, amalgkit curate
uses curate_group
instead of tissue
.
Adding curate_group
should probably be done all the way back in amalgkit metadata
.
Also definitely needs an explanation in the wiki.
Right, could you do it?
on it right now, should be a quick adjustment! I'll probably add the curate_group
column during or after the group_tissues_by_config
call and just copy the tissue
column over.
Amalgkit version 0.5.1.0:
curate_group
column. By default, this contains the tissue
column datacurate_group
column instead of tissue
curate --tissues
is now obsoletecurate --curate_group
takes its place, input is unchanged
https://github.com/kfuku52/amalgkit/commit/f5665c6af979aa27ded6317715aca3a40bf55755Thank. Please describe the default behavior of --curate_group. I assume all values will be included by default.
List of curate_group values of the curate_group metadata column to be included
amalgkit curate --curate_group
is identical to how amalgkit curate --tissue
worked. It was just renamed to avoid confusion.
A typical command would look like:
amalgkit curate --curate_group "root,flower,leaf" [additional arguments]
within the r script, this input string will be split and read into a vector selected_tissues
, which then gets passed to the main algorithm for example to check_whithin_tissue_correlation
.
EDIT:
to add to this, --curate_group
(like --tissues
) does not have a default input, but is required. Theoretically, it would be possible to read selected tissues/conditions/whatever from this column as default, but this can cause all kinds of problems, especially when the metadata sheet contains data from multiple species, or has typos in the column, unused SRR entries, etc.
OK, it makes sense to require --curate_group. Could you describe it in the option? Currently, it's not clear enough (see below). You can provide an example, otherwise, users cannot even know what separator they should use.
List of curate_group values of the curate_group metadata column to be included
Yeah, that's fair. What about:
"comma separated list of values contained in the curate_group metadata column to be included in the analysis. Example input may look like "root,flower,leaf" or "heat stressed,cold stressed,light stressed".
Looks good!
Updated in Ver. 0.5.1.2! https://github.com/kfuku52/amalgkit/commit/857974951b3bd74cf5a9ec39dc436475f7912dc9
@Hego-CCTB Please add any other factors which we should take into account in an ideal test dataset. I'll look for it when I have time.
- small file size of .sra: some bacterial dataset?
- multiple BioProjects: 2 or 3?
- 2 species
- reference transcriptome fasta files are downloadable, maybe from amalgkit repo
- pre-calculated orthofinder outputs are downloadable, maybe from amalgkit repo
I've been looking for some bacterial sets the other day. With various combinations of
with stresses or specific antibiotics as condition. I was surprised to see that there weren't that many RNAseq experiments. E.coli
produced a couple of hits when running metadata
, but the other 2 species didn't have much to offer.
Could you share a summary (maybe a table?) of your survey?
Test_data_quick_survey.zip
Here is the last amalgkit metadata
run I did, along with a summary metadata.tsv. Keywords were: stress
, antibiotics
, tetracycline
. The species were the tree I mentioned in the above comment.
I did not anticipate all the different strains, which could be a different problem. In the summary I put in some possible candidate samples, which followed these criteria:
The best I could find was anaerobic/hypoxia stress. Escherichia coli and Mycobacterium Tuberculosis had 2 bioprojects for both species for anaerobic/hypoxia stress. Although it might be a stretch to put anaerobic into the same category as hypoxia.
Thank you. E. coli looks promising as expected. I'll search for other species that are suitable for the comparison.
@Hego-CCTB I will take care of it if you don't have time.
Yes, please help me out with this issue!
I'd like to create a full bacterial dataset for the paper this week, so we may just be able to use a subset for this issue.
To make testing easier and quicker, we should find a minimal dataset, and run most, if not all, functionalities from
metadata
tocurate
with it. Ideally,