iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

Mapping/remapping of RNA-seq data #302

Closed faricazjj closed 2 years ago

faricazjj commented 2 years ago

Due to questionable provenance of the alignment files uploaded previously to the publicly facing server, we have decided to re-map all RNA-seq samples described in the summary table GSheet.

I have taken notes on the specific steps taken in this final re-alignment using nextflow (v21.10.6.5660, run form environment created by the apaeval_execution_workflows.yml) nf-core/rnaseq (v3.8.1) which can be found in this GDoc. Below I briefly describe the exact annotations used and options used with alignment.

Annotations / Genome Data used

Running nf-core/rnaseq The command to run a dataset looks something like the following (for the mayr dataset): nextflow run nf-core/rnaseq --input /data/apaeval/nf_rnaseq/mayr_samplesheet.csv --outdir /data/apaeval/nf_rnaseq/mayr/ -profile docker --aligner star_salmon --save_reference --gencode --fasta /data/apaeval/genome_data/GRCh38.primary_assembly.genome.fa.gz --gtf /data/apaeval/genome_data/gencode.v38.annotation.gtf.gz --save_trimmed --skip_markduplicates --skip_stringtie --save_unaligned --skip_bbsplit

All commands used to run the various datasets can be found within the GDoc starting here

All dataset sample sheets can be found in this GDrive directory

Progress on alignments of datasets

Alignment Quality This should have been looked at before, but I will update this issue with links to the QC and alignment qualities once they are run and I can look through for the most applicable summary / outputs. MultiQC updates for each dataset are available in a shared GDrive directory here.

ninsch3000 commented 2 years ago

I confirm @dominikburri 's "anecdotal" observation, that samples seem to map okish to chr1, but on other chromosomes either don't map or totally mismatch. I looked at SRR6795718, SRR6795713 (both Mayr) and SRR1573494 (NOT Mayr, so the samples don't seem to be at fault).

For the re-mapped files, can we get some mapping statistics and quality reports? Are you using nf-core/rnaseq? They should have these kind of reports, don't they? If not, you might want to consider using a different pre-processing pipeline, as I think it is important (especially for a benchmarking paper) to provide provenance and quality information about the utilized data. Along the same lines, we need to know exactly which genome version/files have been used.

mrgazzara commented 2 years ago

@ninsch3000 Yes the problem was with the headers in the bam files. The provenance of the files uploaded to the public facing server is questionable (we have moved around form computing resources many times in the past year), it is more than likely the wrong set of files were grabbed.

I have been in the process of remapping everything with nf-core/rnaseq again. I am writing up a good description / documentation of exactly how things are run, etc. and will post a full description with links to the results above soon.

mrgazzara commented 2 years ago

Mapping is complete. I will upload and review the MultiQC results (which nicely collects many QC metrics in one spot) and update the bam files on the public facing server tomorrow

ninsch3000 commented 2 years ago

Awesome! Love the MultiQC reports @mrgazzara ! They're also really useful to expand the data section of the manuscript a little more, as we can nicely see how the different datasets cover different biotypes, genomic origins/read distributions, etc.

mrgazzara commented 2 years ago

Updates on QC:

Final thing to do before closing for now is wait for the uploads to finish.

mrgazzara commented 2 years ago

Upload is complete. Finalized bam files can be found here: https://majiq.biociphers.org/data/apaeval/bams/ with description of samples in the datasets GSheet RNA-seq tab.

Closing this issue!