Mapping/remapping of RNA-seq data

faricazjj commented 2 years ago

Due to questionable provenance of the alignment files uploaded previously to the publicly facing server, we have decided to re-map all RNA-seq samples described in the summary table GSheet.

I have taken notes on the specific steps taken in this final re-alignment using nextflow (v21.10.6.5660, run form environment created by the apaeval_execution_workflows.yml) nf-core/rnaseq (v3.8.1) which can be found in this GDoc. Below I briefly describe the exact annotations used and options used with alignment.

Annotations / Genome Data used

Human RNA-seq (Mayr, HEK293, keratinocyte datasets): GENCODE release 38 -- GTF: comprehensive gene annotation on reference chromosomes only, FTP link: gencode.v38.annotation.gtf.gz -- Fasta: Genome sequence, primary assembly (GRCh38), FTP link: GRCh38.primary_assembly.genome.fa.gz
Human GTEx simulation data (GTEXsim dataset): GENCODE release 26 -- GTF: comprehensive gene annotation on reference chromosomes only, FTP link: gencode.v26.annotation.gtf.gz -- Fasta: Genome sequence, primary assembly (GRCh38), FTP link: GRCh38.primary_assembly.genome.fa.gz
Mouse RNA-seq (P19 and MmusCortex datasets): GENCODE release M18 -- GTF: comprehensive gene annotation on reference chromosomes only, FTP link: gencode.vM18.annotation.gtf.gz -- Fasta: Genome sequence, primary assembly (GRCm38), FTP link: GRCm38.primary_assembly.genome.fa.gz

Running nf-core/rnaseq The command to run a dataset looks something like the following (for the mayr dataset): nextflow run nf-core/rnaseq --input /data/apaeval/nf_rnaseq/mayr_samplesheet.csv --outdir /data/apaeval/nf_rnaseq/mayr/ -profile docker --aligner star_salmon --save_reference --gencode --fasta /data/apaeval/genome_data/GRCh38.primary_assembly.genome.fa.gz --gtf /data/apaeval/genome_data/gencode.v38.annotation.gtf.gz --save_trimmed --skip_markduplicates --skip_stringtie --save_unaligned --skip_bbsplit

All commands used to run the various datasets can be found within the GDoc starting here

All dataset sample sheets can be found in this GDrive directory

Progress on alignments of datasets

[x] Mayr immune cells (14 samples)
[x] GTEx simulations (20 samples)
[x] HEK293 hnRNPC knockdown (4 samples)
[x] keratinocyte differentiation (6 samples)
[x] P19 SR protein knockdowns (6 samples)
[x] Mouse Cortex (4 samples)

Alignment Quality This should have been looked at before, but I will update this issue with links to the QC and alignment qualities once they are run and I can look through for the most applicable summary / outputs. MultiQC updates for each dataset are available in a shared GDrive directory here.

[x] Upload and share MultiQC results
[x] Assess alignment/read quality metrics
[x] Upload new alignments to server

ninsch3000 commented 2 years ago

I confirm @dominikburri 's "anecdotal" observation, that samples seem to map okish to chr1, but on other chromosomes either don't map or totally mismatch. I looked at SRR6795718, SRR6795713 (both Mayr) and SRR1573494 (NOT Mayr, so the samples don't seem to be at fault).

For the re-mapped files, can we get some mapping statistics and quality reports? Are you using nf-core/rnaseq? They should have these kind of reports, don't they? If not, you might want to consider using a different pre-processing pipeline, as I think it is important (especially for a benchmarking paper) to provide provenance and quality information about the utilized data. Along the same lines, we need to know exactly which genome version/files have been used.

mrgazzara commented 2 years ago

@ninsch3000 Yes the problem was with the headers in the bam files. The provenance of the files uploaded to the public facing server is questionable (we have moved around form computing resources many times in the past year), it is more than likely the wrong set of files were grabbed.

I have been in the process of remapping everything with nf-core/rnaseq again. I am writing up a good description / documentation of exactly how things are run, etc. and will post a full description with links to the results above soon.

mrgazzara commented 2 years ago

Mapping is complete. I will upload and review the MultiQC results (which nicely collects many QC metrics in one spot) and update the bam files on the public facing server tomorrow

ninsch3000 commented 2 years ago

Awesome! Love the MultiQC reports @mrgazzara ! They're also really useful to expand the data section of the manuscript a little more, as we can nicely see how the different datasets cover different biotypes, genomic origins/read distributions, etc.

mrgazzara commented 2 years ago

Updates on QC:

Mmus_Cortex data was flagged as incorrect reverse strandedness, with the suggestion it was forward. My hand checking confirmed that this was correct: This dataset is in fact forward strandeded. This dataset has been remapped, the strandedness column in the Sample Summary GSheet has been updated, and the MultiQC results for the correct forward stranded alignments are in the MultiQC results GDrive directory.
GTEXsim data was flagged as being incorrect forward strandedness, with the suggestion it was unstranded. My checking of this suggests the data is forward stranded and I am unsure why it was flagged. I have documented my checks (running the pipeline with forward, reverse, and unstranded and checking alignment Biotypes, looking at stranded bigWig tracks, etc.) with results documented in a section of the Mapping GDoc. I am leaving it as forward stranded for now. It is possible that the simulated reads were generated in some weird way and perhaps some tools may complain about this, but we will raise that issue later, if necessary.

Final thing to do before closing for now is wait for the uploads to finish.

mrgazzara commented 2 years ago

Upload is complete. Finalized bam files can be found here: https://majiq.biociphers.org/data/apaeval/bams/ with description of samples in the datasets GSheet RNA-seq tab.

Closing this issue!

iRNA-COSI / APAeval

Mapping/remapping of RNA-seq data #302