kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

num_read_fastp #25

Closed kfuku52 closed 3 years ago

kfuku52 commented 3 years ago

I couldn't identify when num_read_fastp is added to the metadata table, and its lack caused an error in the curate step. Am I missing something?

kfuku52 commented 3 years ago

This column seems to be used in cstmm too, so should be generated before that.

Hego-CCTB commented 3 years ago

I just double checked the data I've been downloading recently and it's missing there as well. I remember num_read_fastp got added in the metadata.tsv output that is produced for each SRR ID, not the full table produced by amalgkit metadata.

It's missing there as well, though.

In curate num_read_fastp is used for mapping rate calculation, but it should not be part of cstmm.

kfuku52 commented 3 years ago

OK, I thought cross_species.R is a part of cstmm, but it's not actually. We need mapping rates, num_read_fastp, and other related info to be reported as a part of metadata table. We shouldn't overwrite the input metadata, so a new tsv should be created. The new table should also store other per-SRR stats that can only be calculated after processing fastq, e.g., TMM normalization factors, updated exclusion column (e.g., low mapping rate), and so on. Would you be able to work on this?

Hego-CCTB commented 3 years ago

Having briefly looked through the code, num_read_fastp should be saved somewhere in the variable/dataframe seq_summary, but isn't actually properly dumped into the new metadata.tsv. I will investigate this.

Hego-CCTB commented 3 years ago

OK, I thought cross_species.R is a part of cstmm, but it's not actually. We need mapping rates, num_read_fastp, and other related info to be reported as a part of metadata table. We shouldn't overwrite the input metadata, so a new tsv should be created. The new table should also store other per-SRR stats that can only be calculated after processing fastq, e.g., TMM normalization factors, updated exclusion column (e.g., low mapping rate), and so on. Would you be able to work on this?

cross_species.R is an obsolete debugging file, I will remove this file in the next update.

Since the user has the option to not use fastp during quant, we should also account for that possibilty as well. And the mapping rate is calculated with the original metadata stats.

kfuku52 commented 3 years ago

Since the user has the option to not use fastp during quant, we should also account for that possibilty as well.

This is right. You can pick num_read_XXX depending on its presence.

And the mapping rate is calculated with the original metadata stats.

This is not correct. The read numbers should be obtained from actual fastq. This is because there is no guarantee that all the reads have been extracted.

Hego-CCTB commented 3 years ago

Turns out it was a very straight forward solution. I can add any other fastp and fastq statistics as well, like dumped, rejected and written reads. Should I just add them all?

Also I would change curate to take the fastq_dump written reads for mapping rate calculation, if num_read_fastp isn't available, correct?

Hego-CCTB commented 3 years ago

Another thing: My current solution is to use the already implemented flag --save_metadata and output the statistics within a single-row metadata.tsv

But since this table contains statistics necessary for curate, even if fastp was turned off, I'd remove the argument --save_metadata entirely and just always output the single-row metadata.tsv. We can keep the option, of course and leave it on as default, but I can't think of any reason why someone would not need this output.

kfuku52 commented 3 years ago

Turns out it was a very straight forward solution. I can add any other fastp and fastq statistics as well, like dumped, rejected and written reads. Should I just add them all?

Yes, let's keep as many info as possible.

Also I would change curate to take the fastq_dump written reads for mapping rate calculation, if num_read_fastp isn't available, correct?

Yes, you are right.

But since this table contains statistics necessary for curate, even if fastp was turned off, I'd remove the argument --save_metadata entirely and just always output the single-row metadata.tsv. We can keep the option, of course and leave it on as default, but I can't think of any reason why someone would not need this output.

Please obsolete the option as it becomes a mandatory feature.

Hego-CCTB commented 3 years ago

I need one last clarification: num_read_unfiltered in curatewould be equivalent to the number of fastq dumped reads in getfastq, or something else entirely?

kfuku52 commented 3 years ago

Should be the same. I annotated this as num_read_unfiltered because there was another "filtering" step in the NC paper (filtering out reads mapped to mitochondria genomes), which is not present in amalgkit because of its limited effectiveness. You can rename the column.

Hego-CCTB commented 3 years ago

Changes

getfastq

getfastq output now looks like this:

Screenshot 2021-04-23 at 13 31 42

The metadata.tsv is a single line (plus header) for that particular ID. New statistics added to the metadata.tsv:

curate

If these changes are OK, I'll push the update.

Hego-CCTB commented 3 years ago

Idea to maybe add tomerge or another subfunctionality (maybe amalgkit gather), to collect and merge all the metadata.tsv`s from the getfastq output into a single file.

kfuku52 commented 3 years ago

I'm wondering where we should put the metadata single-row tsv files. These files will be accessed/updated by different subcommands, so wouldn't it make more sense to put them in a new, dedicated sub-directory under outdir/metadata/ rather than under outdir/getfastq/?

I'm thinking something like this:

outdir
 |
 +--metadata
      |
      +--metadata
      +--pivot_tables
      +--temp
      +--updated_metadata (<-THIS!)
            |
            +--metadata_SRR0000000.tsv
            +--metadata_SRR0000001.tsv
            +--metadata_SRR0000002.tsv
Hego-CCTB commented 3 years ago

yeah, that makes a lot of sense. And then the merge of all these .tsv"s is just a cat away to be used for curate too.

kfuku52 commented 3 years ago

curate should change the metadata, because there may be some excluded SRRs, which should be noted in the exclusion column. Merging all single-row tsv should be done in the very final step after curate, maybe amalgkit summary.

Hego-CCTB commented 3 years ago

In that case I'll need to also adjust curate to take the updated_metadata folder as additional input to get the num_read_fastp info.

Hego-CCTB commented 3 years ago

for mapping rate calculation

Hego-CCTB commented 3 years ago

and calculation of lost read %

kfuku52 commented 3 years ago

Yes, multiple subcommands including curate should share and use a function to load it into DataFrame. I guess it wouldn't take time even when 1000s rows should be concatenated.

Hego-CCTB commented 3 years ago

CHANGES

amalgkit getfastq

amalgkit curate

kfuku52 commented 3 years ago

This join would be confusing. --updated_metadata_dir should receive a valid path, not a truncated path that is valid only when joined with args.work_dir. Same applies to https://github.com/kfuku52/amalgkit/issues/24.

updated_metadata_dir = os.path.join(args.work_dir, args.updated_metadata_dir)

Also, please change to required=False. You can set arbitrary default value such as "infer" and detect it later to replace it with as.path.join(args.work_dir, 'metadata/updated_metadata/'). Please see --iqtree_treefile in csubst for example.

pcu.add_argument('--updated_metadata_dir', metavar='PATH', default='./metadata/updated_metadata/', type=str, required=True, action='store',
help='default=%(default)s: Directory containing metadata.tsv files as put out by amalgkit getfastq.')
Hego-CCTB commented 3 years ago

Just FYI, some quick testing make it seem that os.path.join(args.work_dir, args.updated_metadata) is smart enough to resolve the correct path: Here, I give absolute paths to all input files: amalgkit curate --work_dir /Users/s229181/MSN --metadata /Users/s229181/MSN/metadata/Metadata_all_2.tsv --infile /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv --eff_len_file /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv --tissues 'root,flower,leaf' --norm fpkm --cleanup 1 --updated_metadata_dir /Users/s229181/MSN/metadata/updated_metadata/

the R-script reports (ommiting some other input parameters):

working_directory = /Users/s229181/MSN
metadata:  /Users/s229181/MSN/metadata/Metadata_all_2.tsv 
infile:  /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv 
gene length file:  /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv 
updated metadata directory:  /Users/s229181/MSN/metadata/updated_metadata 

The same code, but I only give the truncated paths, starting from the working directory: amalgkit curate --work_dir /Users/s229181/MSN/ --metadata metadata/Metadata_all_2.tsv --infile counts/Citrullus_lanatus_est_counts.tsv --eff_len_file eff_length/Citrullus_lanatus_eff_length.tsv --tissues 'root,flower,leaf' --norm fpkm --cleanup 1 --updated_metadata_dir metadata/updated_metadata/

Rscript reports:

working directory:  /Users/s229181/MSN 
metadata:  /Users/s229181/MSN/metadata/Metadata_all_2.tsv 
infile:  /Users/s229181/MSN/counts/Citrullus_lanatus_est_counts.tsv 
gene length file:  /Users/s229181/MSN/eff_length/Citrullus_lanatus_eff_length.tsv 
updated metadata directory:  /Users/s229181/MSN/metadata/updated_metadata

I do concede that it might be confusing, but whatever the user decides should still be working as intended.

kfuku52 commented 3 years ago

The coding should be done in a way that prevents possible errors in the future. Please imagine the cases like this, in which os.path.join() failed to handle nested directories with the same name. https://stackoverflow.com/questions/44022916/combine-two-paths-with-common-folder

Hego-CCTB commented 3 years ago

Ah, yes, that makes sense. I did not consider this. Will adjust the code accordingly!

Hego-CCTB commented 3 years ago

Changed the pathing default to "inferred". If inferred, path is set to workdir/metadata/updated_metadata/, otherwise the full path the user provided.

I pushed the update to master, so the updated_metadata input from getfastq is now required. For testing, I split my large metadata table from metadata into single line metadata files with:

 awk -F $'\t' '{print>$1;close($1)}' metadata.tsv
 rename 's/$/\.tsv/' *

then you have to add the header of the original metadata.tsv to all the small ones with: sed -i '1i \copy_and_paste_header_here' ./*.tsv

  # better way:
  header="`cat run.tsv`"
  gsed -i '1i '"${header}" ./*.tsv
  rm run.tsv

these should be in the /metadata/updated_metadata/ folder and should be in the correct format. Note: since the header contains tabs, it might be necessary to replace those in a different file with a different separator, add them to the .tsv's and then reintroduce the tab-separator.

EDIT: FIXED THE COMMANDS ABOVE

kfuku52 commented 3 years ago

Thank you. Please close the issue if everything looks fine.

Hego-CCTB commented 3 years ago

fixed the rest of the input files as well. All curate inputs require the full path (instead of work_dir + truncated path)

kfuku52 commented 3 years ago

The above awk command wasn't valid on my mac. It worked when {...} was single-quoted, but the result looks not correct. Files were generated per-species, not per-run. Could you share a couple of single-line metadata files? I need to know how it should look like, especially in file names.

Hego-CCTB commented 3 years ago

ah yeah, sorry! The number behind the $ need to match the run column number. In a metadata sheet put out by metadata it should be $16, I think? I rearranged the run column to the front in my original file. Here is for reference, they look like the big, original file, but have just a single entry, with the name RUN_ID.tsv: updated_metadata.zip

Hego-CCTB commented 3 years ago

also, rename (brew install rename)has to be installed via brew and sed -i does not work with mac either (mac sed does not have sed -i for some reason. I installed GNU sed on my mac via brew: brew install gnu-sed)

kfuku52 commented 3 years ago

Thank you. The rename command wasn't valid too but it seems you've corrected already. Good! So the last thing to fix is the sed command.

header="`cat run.tsv`"
gsed -i '1i '"${header}" ./*.tsv
rm run.tsv
Hego-CCTB commented 3 years ago

ah yeah, that's a lot cleaner than what I did. I manually copied the header line from the original metadata, changed the tab seperator to ;;; and used sed to change it back to tab once the header was in the target file. :D

kfuku52 commented 3 years ago

It seems that I didn't tell you enough about the "inferred" option. This is not right:

    if args.updated_metadata_dir == "inferred":
        updated_metadata_dir = os.path.join(args.work_dir, args.updated_metadata_dir)
    else:
        updated_metadata_dir = os.path.realpath(args.updated_metadata_dir)

Because it produces a PATH like /path/to/gfe_data/inferred, which doesn't make sense. And this is what we wanted:

    if args.updated_metadata_dir == "inferred":
        updated_metadata_dir = os.path.join(args.work_dir, 'metadata/updated_metadata')
    else:
        updated_metadata_dir = os.path.realpath(args.updated_metadata_dir)

Do you have a test dataset to quickly confirm whether your changes work well? I'm wondering how you test new functionalities.

kfuku52 commented 3 years ago

fixed in https://github.com/kfuku52/amalgkit/commit/b8bed6f1deccdda7c4f2e9e1c4d38e37bb9b7a69

Hego-CCTB commented 3 years ago

I usually use a dataset I'm currently working with to test these. Sometimes I have to create an artificial one, like I did as described a few comments above (where I split the large metadata sheet into hundreds of artificial updated_metadata.tsv's). During the last week, I could successfully test this with actual updated_metadata from ~3000 SRA samples (out of which ~2500 made it into curate). Updated metadata was properly created by getfastq and were read by curate and processed without issue.

kfuku52 commented 3 years ago

At some point, we should find a minimal dataset only for testing. Continued here https://github.com/kfuku52/amalgkit/issues/41