Closed kfuku52 closed 3 years ago
yes! I want to start processing my own data soon, anyway, so I can do this in tandem. Originally I wanted to do this as part of sanity
, but it may be better to have this as it's own functionality.
One issue that immediately comes to my mind is the case when someone wants to mix SRA data and own data, just because of the metadata column differences. It will probably be necessary to expand merge
to be able to integrate an "SRA metadata.tsv" and an "in-house metadata.tsv".
Yes, we should design it carefully. My idea on the workflow is:
amalgkit metadata
to collect NCBI's data/own_fastq_dir/*.fq
or /own_fastq_dir/GENUS_SPECIES/*.fq
.amalgkit sanity
or a tailored subcommand that takes --metadata
and --own_fastq_dir
as options. This operation appends own fastq info to the existing metadata table and outputs the combined table in /metadata/metadata
. Most likely, what we can do here is to add very little info, such as scientific_name
, is_sampled
(Yes), is_qualified
(Yes), exclusion
(no), run
(any IDs should be fine, but probably we can extract an ID from the file name), lib_layout
(guess it from file names. _1 & _2 or not), spot_length
, total_spots
, total_bases
, size
, and a new column that stores absolute PATH to the fastq file. seqkit stats
would be useful if python-based libraries are too slow to calculate seq statistics.scientific_name
(if not added yet), curate_group
. Other info can be filled as well but won't change amalgkit's behavior.amalgkit getfastq
. getfastq
shouldn't make any changes to the original fastq files but use it as fastp
input, and generate .amalgkit.fastq
in /getfastq/RUN
. If fastp
is disabled, the required operation is only to copy the file.amalgkit quant
and downstream subcommand as usual.Yeah, I agree on most points.
Yes, we should design it carefully. My idea on the workflow is:
- Run
amalgkit metadata
to collect NCBI's data
By this you probably mean the case I've mention above, where someone would want to mix in-house and SRA data, right? So you would combine the SRA metadata and in-house metadata right in this subfunction rather than later during merge
?
Yeah, that is likely more sensible esp. for quant
input later.
- Run
amalgkit sanity
or a tailored subcommand that takes--metadata
and--own_fastq_dir
as options. This operation appends own fastq info to the existing metadata table and outputs the combined table in/metadata/metadata
. Most likely, what we can do here is to add very little info, such asscientific_name
,is_sampled
(Yes),is_qualified
(Yes),exclusion
(no),run
(any IDs should be fine, but probably we can extract an ID from the file name),lib_layout
(guess it from file names. _1 & _2 or not),spot_length
,total_spots
,total_bases
,size
, and a new column that stores absolute PATH to the fastq file.seqkit stats
would be useful if python-based libraries are too slow to calculate seq statistics.- Manually fill the metadata table. The minimum addition should be
scientific_name
(if not added yet),curate_group
. Other info can be filled as well but won't change amalgkit's behavior.
An alternative solution to this would be how Trinity
handles --sample_list
. If you have multiple samples for assembly, you can create a simple .txt
file in this format:
cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq
cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq
cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq
cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq
This would have two up-sides: 1) The user can fill in the data before starting amalgkit
and can have this run without supervision until the files are ready for quant
and beyond. And 2) this of course simplifies the coding too, if we just have to parse in a file.
- Run
amalgkit getfastq
.getfastq
shouldn't make any changes to the original fastq files but use it asfastp
input, and generate.amalgkit.fastq
in/getfastq/RUN
. Iffastp
is disabled, the required operation is only to copy the file.
Makes sense, but rather than adjusting and running getfastq
I'd just borrow the run_fastp
function from there and run it within sanity/new subfunction
Makes sense, but rather than adjusting and running getfastq I'd just borrow the run_fastp function from there and run it within sanity/new subfunction
I think we should stick to getfastq
. Is there any advantage to separately run it?
This would have two up-sides: 1) The user can fill in the data before starting amalgkit and can have this run without supervision until the files are ready for quant and beyond. And 2) this of course simplifies the coding too, if we just have to parse in a file.
metadata.tsv
should be manually inspected anyway, no matter whether you have your own fastq or not. To me, it's difficult to imagine that requiring another format of input metadata save user's time and our coding effort.
Ah, yeah I assumed that the SRA data would already be downloaded and processed and it would be the in-house data that needed to "catch up".
It makes much more sense if the SRA data still needs to go through getfastq
download. In that case I agree with the workflow above.
It wouldn't be a problem even if SRA data were already processed. You can set getfastq/quant --redo no
and the SRA download won't start again.
For me it was more about just skipping getfastq
altogether and not needing to manually start a different command before quant
, or would you have getfastq
automatically be called by amalgkit
?
getfastq automatically be called by amalgkit
I'm not sure what this mean, but getfastq
should be evoked manually, just like any other subcommands should be too.
getfastq automatically be called by amalgkit
I'm not sure what this mean, but
getfastq
should be evoked manually, just like any other subcommands should be too.
Gotcha! I'll stick to the above workflow!
3.
seqkit stats
would be useful if python-based libraries are too slow to calculate seq statistics.
I have decided to go with this approach. This turns out to be really fast (esp. on multiple threads) and I don't have to worry about gzipped files.
Ok, I am at a point where I can infer/calculate sequence statistics and put them into a metadata.tsv.
The metadata of the private fastq data can also be merged with public fastq metadata, if the --metadata
parameter is provided.
This is how the private fastq metadata looks like:
tmp_metadata.tsv.zip
And this is how it looks like when merged with public fastq metadata:
metadata.tsv.zip
Note: For testing purposes I included only one of the reads for the PAUL23 samples to see if it gets recognized as single
liblayout. This process assumes that forward/reverse reads are distinguished by _X and _Y, like sample1_1.fastq.gz and sample1_2.fastq.gz. (fq.gz, .fastq, and .fq should also be supported file formats, as well as any number of underscores in the sample name, like sample_1_1.fastq.gz and sample_1_2.fastq.gz)
Todo list:
getfastq
output for a particular run is detected, mark it in a new column so it can be skipped by getfastq
latergetfastq
to download/skip/process private and/or public data if necessaryTo be consistent with SRA's metadata, is_sampled
and is_qualified
should be Yes
. But I feel the SRA's side should be changed to yes
so it's consistent with other fields such as exclusion
.
amalgkit integrate
amalgkit integrate
creates metadata based on fastq files found in a user defined directory--metadata PATH_TO_METADATA
is supplied, amalgkit integrate
will integrate the in-house fastq metadata with the supplied metadata sheet. It will also make a note, if getfastq
output of public files is already presentamalgkit getfastq
to process private fastq files. This requires the amalgkit integrate
metadata.tsv outputamalgkit integrate
:amalgkit integrate -w PATH/TO/WORKING/DIRECTORY \
--fastq_dir PATH/TO/FASTQ/DIRECTORY \
--threads 8 \
--metadata PATH/TO/METADATA.TSV
amalgkit getfastq
--private_fastq yes
to the getfastq
command you would usually use and supply the appropriateamalgkit integrate
metadata sheet to --metadata
.amalgkit integrate
, they will be skipped entirely. This can be changed by setting --redo yes
. getfastq
.fastp
if --fastp yes
is set and moved to ./getfastq/RUN-ID/
. Otherwise, they'll just be moved. Original private fastq files will not be deleted by --remove_intermediate
(public files would normally be).Currently running tests including all combinations of public/private single/paired files, as well as making sure the original getfastq
usage with just public files still runs normally. I'll push the update after testing.
rolled out the update: https://github.com/kfuku52/amalgkit/commit/b9eec3fc0c1947542be4e56ad9ab3627d5892458
Thank you!
Simply add --private_fastq yes to the getfastq command you would usually use and supply the appropriate amalgkit integrate metadata sheet to --metadata
I'm not sure if --private_fastq
is necessary because getfastq
should process all entries in the metadata table. So isn't it enough to pass integrate
-derived metadata to getfastq
?
Hmm. I suppose it's possible to look for presence of integrate
specific columns and set a flag in the code, rather than by the user, yes.
Yeah, I can obsolete the --private_fastq
argument!
--private_fastq
. Instead integrate
-derived metadata is detected automatically by presence of certain columnshttps://github.com/kfuku52/amalgkit/commit/ea008e53c9ff257154212dda5b57ac0cf6673986
There is a work-around discussed previously, but
amalgkit
should officially support it. @Hego-CCTB This feature will be necessary for your project anyway, so could you take care of it?