Closed kfuku52 closed 3 years ago
This column seems to be used in cstmm
too, so should be generated before that.
I just double checked the data I've been downloading recently and it's missing there as well. I remember num_read_fastp
got added in the metadata.tsv output that is produced for each SRR ID, not the full table produced by amalgkit metadata
.
It's missing there as well, though.
In curate
num_read_fastp is used for mapping rate calculation, but it should not be part of cstmm.
OK, I thought cross_species.R
is a part of cstmm
, but it's not actually.
We need mapping rates, num_read_fastp, and other related info to be reported as a part of metadata table. We shouldn't overwrite the input metadata, so a new tsv should be created. The new table should also store other per-SRR stats that can only be calculated after processing fastq, e.g., TMM normalization factors, updated exclusion
column (e.g., low mapping rate
), and so on. Would you be able to work on this?
Having briefly looked through the code, num_read_fastp
should be saved somewhere in the variable/dataframe seq_summary
, but isn't actually properly dumped into the new metadata.tsv.
I will investigate this.
OK, I thought
cross_species.R
is a part ofcstmm
, but it's not actually. We need mapping rates, num_read_fastp, and other related info to be reported as a part of metadata table. We shouldn't overwrite the input metadata, so a new tsv should be created. The new table should also store other per-SRR stats that can only be calculated after processing fastq, e.g., TMM normalization factors, updatedexclusion
column (e.g.,low mapping rate
), and so on. Would you be able to work on this?
cross_species.R
is an obsolete debugging file, I will remove this file in the next update.
Since the user has the option to not use fastp during quant, we should also account for that possibilty as well. And the mapping rate is calculated with the original metadata stats.
Since the user has the option to not use fastp during quant, we should also account for that possibilty as well.
This is right. You can pick num_read_XXX
depending on its presence.
And the mapping rate is calculated with the original metadata stats.
This is not correct. The read numbers should be obtained from actual fastq. This is because there is no guarantee that all the reads have been extracted.
Turns out it was a very straight forward solution. I can add any other fastp and fastq statistics as well, like dumped, rejected and written reads. Should I just add them all?
Also I would change curate to take the fastq_dump written reads for mapping rate calculation, if num_read_fastp isn't available, correct?
Another thing: My current solution is to use the already implemented flag --save_metadata
and output the statistics within a single-row metadata.tsv
But since this table contains statistics necessary for curate, even if fastp was turned off, I'd remove the argument --save_metadata
entirely and just always output the single-row metadata.tsv. We can keep the option, of course and leave it on as default, but I can't think of any reason why someone would not need this output.
Turns out it was a very straight forward solution. I can add any other fastp and fastq statistics as well, like dumped, rejected and written reads. Should I just add them all?
Yes, let's keep as many info as possible.
Also I would change curate to take the fastq_dump written reads for mapping rate calculation, if num_read_fastp isn't available, correct?
Yes, you are right.
But since this table contains statistics necessary for curate, even if fastp was turned off, I'd remove the argument --save_metadata entirely and just always output the single-row metadata.tsv. We can keep the option, of course and leave it on as default, but I can't think of any reason why someone would not need this output.
Please obsolete the option as it becomes a mandatory feature.
I need one last clarification:
num_read_unfiltered in curate
would be equivalent to the number of fastq dumped reads in getfastq
, or something else entirely?
Should be the same. I annotated this as num_read_unfiltered
because there was another "filtering" step in the NC paper (filtering out reads mapped to mitochondria genomes), which is not present in amalgkit because of its limited effectiveness. You can rename the column.
getfastq
getfastq
output now looks like this:
The metadata.tsv is a single line (plus header) for that particular ID. New statistics added to the metadata.tsv:
num_read_fastq_dumped
(formerly num_read_unfiltered)num_read_fastq_written
num_read_fastq_rejected
num_read_fastp
num_read_fastp_input
curate
num_read_unfiltered
is now num_read_fastq_dumped
fraction_lost_fastp
fraction_lost_fastq
(if num_read_fastp
is unavailable)num_read_fastq_written
if num_read_fastp
is unavailableIf these changes are OK, I'll push the update.
Idea to maybe add tomerge
or another subfunctionality (maybe amalgkit gather
), to collect and merge all the metadata.tsv`s from the getfastq output into a single file.
I'm wondering where we should put the metadata single-row tsv files. These files will be accessed/updated by different subcommands, so wouldn't it make more sense to put them in a new, dedicated sub-directory under outdir/metadata/
rather than under outdir/getfastq/
?
I'm thinking something like this:
outdir
|
+--metadata
|
+--metadata
+--pivot_tables
+--temp
+--updated_metadata (<-THIS!)
|
+--metadata_SRR0000000.tsv
+--metadata_SRR0000001.tsv
+--metadata_SRR0000002.tsv
yeah, that makes a lot of sense. And then the merge of all these .tsv"s is just a cat
away to be used for curate
too.
curate
should change the metadata, because there may be some excluded SRRs, which should be noted in the exclusion
column. Merging all single-row tsv should be done in the very final step after curate, maybe amalgkit summary
.
In that case I'll need to also adjust curate
to take the updated_metadata
folder as additional input to get the num_read_fastp
info.
for mapping rate calculation
and calculation of lost read %
Yes, multiple subcommands including curate
should share and use a function to load it into DataFrame. I guess it wouldn't take time even when 1000s rows should be concatenated.
amalgkit getfastq
amalgkit getfastq
will now output metadata.tsv's with updated fastp/fastq statistics into a folder structure like this:
work_dir
|
+--metadata
|
+--metadata
+--pivot_tables
+--temp
+--updated_metadata (<-THIS!)
|
+--metadata_SRR0000000.tsv
+--metadata_SRR0000001.tsv
+--metadata_SRR0000002.tsv
amalgkit curate
amalgkit curate
now requires --updated_metadata_dir
(kind of a long name, maybe truncating this in the future) as inputamalgkit curate
reads out fastp and fastq statistics from all runs, where exclusion==no
in the output of amalgkit metadata
and calculates
mapping_rate
fraction_lost_fastp
ORfration_lost_fastq
, depending on availabilityThis join would be confusing. --updated_metadata_dir
should receive a valid path, not a truncated path that is valid only when joined with args.work_dir
. Same applies to https://github.com/kfuku52/amalgkit/issues/24.
updated_metadata_dir = os.path.join(args.work_dir, args.updated_metadata_dir)
Also, please change to required=False
. You can set arbitrary default value such as "infer"
and detect it later to replace it with as.path.join(args.work_dir, 'metadata/updated_metadata/')
. Please see --iqtree_treefile
in csubst for example.
pcu.add_argument('--updated_metadata_dir', metavar='PATH', default='./metadata/updated_metadata/', type=str, required=True, action='store',
help='default=%(default)s: Directory containing metadata.tsv files as put out by amalgkit getfastq.')
Just FYI, some quick testing make it seem that os.path.join(args.work_dir, args.updated_metadata)
is smart enough to resolve the correct path:
Here, I give absolute paths to all input files:
amalgkit curate --work_dir /Users/s229181/MSN --metadata /Users/s229181/MSN/metadata/Metadata_all_2.tsv --infile /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv --eff_len_file /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv --tissues 'root,flower,leaf' --norm fpkm --cleanup 1 --updated_metadata_dir /Users/s229181/MSN/metadata/updated_metadata/
the R-script reports (ommiting some other input parameters):
working_directory = /Users/s229181/MSN
metadata: /Users/s229181/MSN/metadata/Metadata_all_2.tsv
infile: /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv
gene length file: /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv
updated metadata directory: /Users/s229181/MSN/metadata/updated_metadata
The same code, but I only give the truncated paths, starting from the working directory:
amalgkit curate --work_dir /Users/s229181/MSN/ --metadata metadata/Metadata_all_2.tsv --infile counts/Citrullus_lanatus_est_counts.tsv --eff_len_file eff_length/Citrullus_lanatus_eff_length.tsv --tissues 'root,flower,leaf' --norm fpkm --cleanup 1 --updated_metadata_dir metadata/updated_metadata/
Rscript reports:
working directory: /Users/s229181/MSN
metadata: /Users/s229181/MSN/metadata/Metadata_all_2.tsv
infile: /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv
gene length file: /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv
updated metadata directory: /Users/s229181/MSN/metadata/updated_metadata
I do concede that it might be confusing, but whatever the user decides should still be working as intended.
The coding should be done in a way that prevents possible errors in the future. Please imagine the cases like this, in which os.path.join()
failed to handle nested directories with the same name.
https://stackoverflow.com/questions/44022916/combine-two-paths-with-common-folder
Ah, yes, that makes sense. I did not consider this. Will adjust the code accordingly!
Changed the pathing default to "inferred". If inferred, path is set to workdir/metadata/updated_metadata/, otherwise the full path the user provided.
I pushed the update to master, so the updated_metadata input from getfastq
is now required. For testing, I split my large metadata table from metadata
into single line metadata files with:
awk -F $'\t' '{print>$1;close($1)}' metadata.tsv
rename 's/$/\.tsv/' *
then you have to add the header of the original metadata.tsv to all the small ones with:
sed -i '1i \copy_and_paste_header_here' ./*.tsv
# better way:
header="`cat run.tsv`"
gsed -i '1i '"${header}" ./*.tsv
rm run.tsv
these should be in the /metadata/updated_metadata/ folder and should be in the correct format.
Note: since the header contains tabs, it might be necessary to replace those in a different file with a different separator, add them to the .tsv's and then reintroduce the tab-separator.
EDIT: FIXED THE COMMANDS ABOVE
Thank you. Please close the issue if everything looks fine.
fixed the rest of the input files as well. All curate inputs require the full path (instead of work_dir + truncated path)
The above awk command wasn't valid on my mac. It worked when {...} was single-quoted, but the result looks not correct. Files were generated per-species, not per-run. Could you share a couple of single-line metadata files? I need to know how it should look like, especially in file names.
ah yeah, sorry! The number behind the $ need to match the run
column number. In a metadata sheet put out by metadata
it should be $16, I think? I rearranged the run
column to the front in my original file.
Here is for reference, they look like the big, original file, but have just a single entry, with the name RUN_ID.tsv:
updated_metadata.zip
also, rename (brew install rename
)has to be installed via brew and sed -i does not work with mac either (mac sed does not have sed -i for some reason. I installed GNU sed on my mac via brew: brew install gnu-sed
)
Thank you. The rename command wasn't valid too but it seems you've corrected already. Good! So the last thing to fix is the sed command.
header="`cat run.tsv`"
gsed -i '1i '"${header}" ./*.tsv
rm run.tsv
ah yeah, that's a lot cleaner than what I did.
I manually copied the header line from the original metadata, changed the tab seperator to ;;;
and used sed to change it back to tab once the header was in the target file. :D
It seems that I didn't tell you enough about the "inferred" option. This is not right:
if args.updated_metadata_dir == "inferred":
updated_metadata_dir = os.path.join(args.work_dir, args.updated_metadata_dir)
else:
updated_metadata_dir = os.path.realpath(args.updated_metadata_dir)
Because it produces a PATH like /path/to/gfe_data/inferred
, which doesn't make sense. And this is what we wanted:
if args.updated_metadata_dir == "inferred":
updated_metadata_dir = os.path.join(args.work_dir, 'metadata/updated_metadata')
else:
updated_metadata_dir = os.path.realpath(args.updated_metadata_dir)
Do you have a test dataset to quickly confirm whether your changes work well? I'm wondering how you test new functionalities.
I usually use a dataset I'm currently working with to test these. Sometimes I have to create an artificial one, like I did as described a few comments above (where I split the large metadata sheet into hundreds of artificial updated_metadata.tsv's).
During the last week, I could successfully test this with actual updated_metadata from ~3000 SRA samples (out of which ~2500 made it into curate
). Updated metadata was properly created by getfastq
and were read by curate
and processed without issue.
At some point, we should find a minimal dataset only for testing. Continued here https://github.com/kfuku52/amalgkit/issues/41
I couldn't identify when
num_read_fastp
is added to the metadata table, and its lack caused an error in thecurate
step. Am I missing something?