Closed BirongZhang closed 1 year ago
Hey there, unfortunately, the workflow has been designed to work with paired-end FASTQ files.
Hi Barry, thanks so much for letting me know that.
I'm looking at the experiment metadata for SRR6343608
It states the data is PAIRED
? Maybe you did not download the data correctly.
Might I suggest using nf-core/fetchngs
which is an excellent workflow for downloading public datasets. Simply provide the workflow with a SRA / ENA / GEO
ID, so for you it would be PRJNA420975
for the first sample. You will have to go digging for the rest..
Good luck!
Hey Barry,
Thanks for your kind help! Let me check my data. I have only downloaded a few samples of this dataset and would like to try them out first. However, I still have much single-end fastq data...
Thanks again!
Kind regards, Birong
I'm looking at the experiment metadata for SRR6343608
It states the data is
PAIRED
? Maybe you did not download the data correctly.Might I suggest using
nf-core/fetchngs
which is an excellent workflow for downloading public datasets. Simply provide the workflow with aSRA / ENA / GEO
ID, so for you it would bePRJNA420975
for the first sample. You will have to go digging for the rest..Good luck!
Hi Barry,
Thanks for letting me know about the useful pipeline nf-core/fetchngs! It works well!
You are right, it is paired end datasetSRR6343628. I don't know why my previous method didn't work.
nohup parallel -j 1 fastq-dump --skip-technical -F ::: $(cat SraAccList.txt)
Last week I tried the same data set: PRJNA420975 with nf-core/fetchngs pipeline, but I was a little confused. Perhaps this data (SRX3441728) is large and the pipeline splits the data into two parts. So when I want to merge them, should I merge them as shown below?
Thanks again for your time and work!
Kind regards, Birong
Hey Birong,
Yep that didn't work because you need to include the --split-3
command in your fatsq-dump
command. This will split the mate pairs into *_1.fastq
and *_2.fastq
files for you. But I see you got nf-core/fetchngs
working :)
Your merge strategy looks correct to me. Judging by the file sizes, they might have split SRX3441728 over two lanes to increase sequencing depth, in which case merging makes sense.
However, just to be safe, run FastQC
on the SRX3441728 samples to make sure one of the lanes wasn't a bad batch.
Also, if you merge the files, check to make sure that the *_1.fastq
and *_2.fastq
mate files have the same number of reads. (I am pretty sure I have come across this error with the aligners in this workflow).
Best, Barry
Hi Barry,
Sorry, it is me, again!
Thanks for your reply! They are all working now. I've successfully downloaded several datasets!
But I have a new problem with circrna pipeline.
I am using the supercomputer Hawk, and paired data. Here is my script:
module load nextflow/21.04.0
module load singularity
nextflow run nf-core/circrna \
-r 892b3136e7432221bd81f8c7cc0400ebe541b08e \
-profile singularity \
--genome 'GRCh37' \
--input "raw_data/SRR6343628_{1,2}.fastq.gz" \
--input_type 'fastq' \
--module 'circrna_discovery' \
--tool 'circexplorer2'
Here is my error:
Error executing process > 'STAR_1PASS (SRR6343628)' Caused by: Process requirement exceed available CPUs -- req: 16; avail: 8 What does this mean? Does this mean the supercomputer didn't meet Pipeline's requirements? But when I run the test data, I successfully get the result(test_outdir). What should I do?
Let me know if you need any further information. Thanks so much for your time and patient!
Best regards, Birong
Hi Birong,
Don't worry about it - happy to help.
So this means that the process STAR_1PASS requested 16 CPUs, but you only have 8 CPUs available on the queue you sent the job to on Hawk. You will need to change the configuration file settings. Try the following:
conf/base.config
file nextflow pull BirongZhang/circrna
, nextflow run BirongZhang/circrna -r dev [...]
Here is what I mean by point 3:
/*
* -------------------------------------------------
* nf-core/circrna Nextflow base config file
* -------------------------------------------------
* A 'blank slate' config file, appropriate for general
* use on most high performace compute environments.
* Assumes that all software is installed and available
* on the PATH. Runs in `local` mode - all jobs will be
* run on the logged in environment.
*/
process {
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 7.GB * task.attempt, 'memory' ) }
time = { check_max( 12.h * task.attempt, 'time' ) }
errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
// Process-specific resource requirements
// NOTE - Only one of the labels below are used in the fastqc process in the main script.
// If possible, it would be nice to keep the same label naming convention when
// adding in your processes.
// TODO nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) } # This line controls CPU usage for process_low labels
memory = { check_max( 14.GB * task.attempt, 'memory' ) }
time = { check_max( 6.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 8 * task.attempt, 'cpus' ) } # If your CPU max is 8, set this to 4?
memory = { check_max( 42.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 16 * task.attempt, 'cpus' ) } # Alignment steps use process_high and request 16CPUs. Change this to 8 CPUS
memory = { check_max( 84.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withName:get_software_versions {
cache = false
}
withLabel:py3{
container = 'barryd237/py3:dev'
}
}
You could ask your system administrator about the maximum CPU and memory capacity of Hawk so you can configure this file in such a way that it never asks for more resources than are available.
@BirongZhang Going to re-open this issue because it has a lot of good troubleshooting questions in it - if that's ok?
Hi Barry,
Thanks so much for your kind help!
I will try what you said before, and ask hawk team about the maximum CPU.
I will let you know what happens. Thanks again.
Best, Birong
Hi Barry,
I am back. I can another highmem partition, so maybe the previous problem could be solved. But this time, new problem emerged before STAR step:
N E X T F L O W ~ version 21.04.0
Launching `nf-core/circrna` [prickly_gautier] - revision: 892b3136e7432221bd81f8c7cc0400ebe541b08e
WARNING: Could not load nf-core/config profiles: https://raw.githubusercontent.com/nf-core/configs/master/nfcore_custom.config
WARN: There's no process matching config selector: get_software_versions
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/circrna v1.0.0
------------------------------------------------------
Input data log info:
No input sample CSV file provided, attempting to read from path instead.
Reading input data from path: raw_data/SRR6343609_{1,2}.fastq.gz
Core Nextflow options
runName : prickly_gautier
containerEngine : singularity
container : barryd237/circrna:dev
launchDir : /scratch/c.c2050857/NAFLD/GSE107650
workDir : /scratch/c.c2050857/NAFLD/GSE107650/work
projectDir : /home/c.c2050857/.nextflow/assets/nf-core/circrna
userName : c.c2050857
profile : singularity
configFiles : /home/c.c2050857/.nextflow/assets/nf-core/circrna/nextflow.config
Input/output options
input : raw_data/SRR6343609_{1,2}.fastq.gz
input_type : fastq
outdir : GSE107650_Results
Reference genome files
genome : GRCh37
STAR
chimScoreSeparation : 10
Generic options
max_multiqc_email_size: 25 MB
Max job request options
max_memory : 128 GB
max_time : 10d
------------------------------------------------------
Only displaying parameters that differ from defaults.
------------------------------------------------------
WARN: Access to undefined parameter `name` -- Initialise it to a default value eg. `params.name = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
WARN: Access to undefined parameter `gtf` -- Initialise it to a default value eg. `params.gtf = some_value`
WARN: Access to undefined parameter `bowtie` -- Initialise it to a default value eg. `params.bowtie = some_value`
WARN: Access to undefined parameter `bowtie2` -- Initialise it to a default value eg. `params.bowtie2 = some_value`
WARN: Access to undefined parameter `bwa` -- Initialise it to a default value eg. `params.bwa = some_value`
WARN: Access to undefined parameter `fasta_fai` -- Initialise it to a default value eg. `params.fasta_fai = some_value`
WARN: Access to undefined parameter `hisat` -- Initialise it to a default value eg. `params.hisat = some_value`
WARN: Access to undefined parameter `star` -- Initialise it to a default value eg. `params.star = some_value`
WARN: Access to undefined parameter `segemehl` -- Initialise it to a default value eg. `params.segemehl = some_value`
[- ] process > SOFTWARE_VERSIONS -
[- ] process > BWA_INDEX -
[- ] process > SAMTOOLS_INDEX -
[- ] process > HISAT2_INDEX -
[- ] process > STAR_INDEX -
[- ] process > BOWTIE_INDEX -
[- ] process > BOWTIE2_INDEX -
[- ] process > SEGEMEHL_INDEX -
[- ] process > FILTER_GTF -
[- ] process > CIRIQUANT_YML -
[- ] process > GENE_ANNOTATION -
[- ] process > BAM_TO_FASTQ -
[- ] process > FASTQC_RAW -
[- ] process > BBDUK -
[- ] process > FASTQC_BBDUK -
[- ] process > CIRIQUANT -
[- ] process > STAR_1PASS -
[- ] process > SJDB_FILE -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`
executor > local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS [ 0%] 0 of 1
[- ] process > BWA_INDEX -
[- ] process > SAMTOOLS_INDEX -
[- ] process > HISAT2_INDEX -
[- ] process > STAR_INDEX -
[- ] process > BOWTIE_INDEX -
[- ] process > BOWTIE2_INDEX -
[- ] process > SEGEMEHL_INDEX -
[- ] process > FILTER_GTF -
[- ] process > CIRIQUANT_YML -
[- ] process > GENE_ANNOTATION -
[- ] process > BAM_TO_FASTQ -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [ 0%] 0 of 1
[- ] process > BBDUK -
[- ] process > FASTQC_BBDUK -
[- ] process > CIRIQUANT -
[- ] process > STAR_1PASS -
[- ] process > SJDB_FILE -
[- ] process > STAR_2PASS -
[- ] process > CIRCEXPLORER2 -
[- ] process > CIRCRNA_FINDER -
[- ] process > DCC_MATE1 -
[- ] process > DCC_MATE2 -
[- ] process > DCC -
[- ] process > FIND_ANCHORS -
[- ] process > FIND_CIRC -
[- ] process > MAPSPLICE_ALIGN -
[- ] process > MAPSPLICE_PARSE -
[- ] process > SEGEMEHL_ALIGN -
[- ] process > ANNOTATION -
[- ] process > FASTA -
[- ] process > COUNT_MATRIX_SINGLE -
[- ] process > TARGETSCAN_DATABASE -
[- ] process > MIRNA_PREDICTION -
[- ] process > MIRNA_TARGETS -
[- ] process > HISAT_ALIGN -
[- ] process > STRINGTIE -
[- ] process > DEA -
[- ] process > MULTIQC -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`
executor > local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS [100%] 1 of 1 ✔
[- ] process > BWA_INDEX -
[- ] process > SAMTOOLS_INDEX -
[- ] process > HISAT2_INDEX -
[- ] process > STAR_INDEX -
[- ] process > BOWTIE_INDEX -
[- ] process > BOWTIE2_INDEX -
[- ] process > SEGEMEHL_INDEX -
[- ] process > FILTER_GTF -
[- ] process > CIRIQUANT_YML -
[- ] process > GENE_ANNOTATION -
[- ] process > BAM_TO_FASTQ -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [ 0%] 0 of 1
[- ] process > BBDUK -
[- ] process > FASTQC_BBDUK -
[- ] process > CIRIQUANT -
[- ] process > STAR_1PASS -
[- ] process > SJDB_FILE -
[- ] process > STAR_2PASS -
[- ] process > CIRCEXPLORER2 -
[- ] process > CIRCRNA_FINDER -
[- ] process > DCC_MATE1 -
[- ] process > DCC_MATE2 -
[- ] process > DCC -
[- ] process > FIND_ANCHORS -
[- ] process > FIND_CIRC -
[- ] process > MAPSPLICE_ALIGN -
[- ] process > MAPSPLICE_PARSE -
[- ] process > SEGEMEHL_ALIGN -
[- ] process > ANNOTATION -
[- ] process > FASTA -
[- ] process > COUNT_MATRIX_SINGLE -
[- ] process > TARGETSCAN_DATABASE -
[- ] process > MIRNA_PREDICTION -
[- ] process > MIRNA_TARGETS -
[- ] process > HISAT_ALIGN -
[- ] process > STRINGTIE -
[- ] process > DEA -
[- ] process > MULTIQC -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`
executor > local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS [100%] 1 of 1 ✔
[- ] process > BWA_INDEX -
[- ] process > SAMTOOLS_INDEX -
[- ] process > HISAT2_INDEX -
[- ] process > STAR_INDEX -
[- ] process > BOWTIE_INDEX -
[- ] process > BOWTIE2_INDEX -
[- ] process > SEGEMEHL_INDEX -
[- ] process > FILTER_GTF -
[- ] process > CIRIQUANT_YML -
[- ] process > GENE_ANNOTATION -
[- ] process > BAM_TO_FASTQ -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [ 0%] 0 of 1
[- ] process > BBDUK -
[- ] process > FASTQC_BBDUK -
[- ] process > CIRIQUANT -
[- ] process > STAR_1PASS -
[- ] process > SJDB_FILE -
[- ] process > STAR_2PASS -
[- ] process > CIRCEXPLORER2 -
[- ] process > CIRCRNA_FINDER -
[- ] process > DCC_MATE1 -
[- ] process > DCC_MATE2 -
[- ] process > DCC -
[- ] process > FIND_ANCHORS -
[- ] process > FIND_CIRC -
[- ] process > MAPSPLICE_ALIGN -
[- ] process > MAPSPLICE_PARSE -
[- ] process > SEGEMEHL_ALIGN -
[- ] process > ANNOTATION -
[- ] process > FASTA -
[- ] process > COUNT_MATRIX_SINGLE -
[- ] process > TARGETSCAN_DATABASE -
[- ] process > MIRNA_PREDICTION -
[- ] process > MIRNA_TARGETS -
[- ] process > HISAT_ALIGN -
[- ] process > STRINGTIE -
[- ] process > DEA -
[- ] process > MULTIQC -
Error executing process > 'STAR_1PASS (null)'
Caused by:
Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)
executor > local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS [100%] 1 of 1 ✔
[- ] process > BWA_INDEX -
[- ] process > SAMTOOLS_INDEX -
[- ] process > HISAT2_INDEX -
[- ] process > STAR_INDEX -
[- ] process > BOWTIE_INDEX -
[- ] process > BOWTIE2_INDEX -
[- ] process > SEGEMEHL_INDEX -
[- ] process > FILTER_GTF -
[- ] process > CIRIQUANT_YML -
[- ] process > GENE_ANNOTATION -
[- ] process > BAM_TO_FASTQ -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [ 0%] 0 of 1
[- ] process > BBDUK -
[- ] process > FASTQC_BBDUK -
[- ] process > CIRIQUANT -
[- ] process > STAR_1PASS -
[- ] process > SJDB_FILE -
[- ] process > STAR_2PASS -
[- ] process > CIRCEXPLORER2 -
[- ] process > CIRCRNA_FINDER -
[- ] process > DCC_MATE1 -
[- ] process > DCC_MATE2 -
[- ] process > DCC -
[- ] process > FIND_ANCHORS -
[- ] process > FIND_CIRC -
[- ] process > MAPSPLICE_ALIGN -
[- ] process > MAPSPLICE_PARSE -
[- ] process > SEGEMEHL_ALIGN -
[- ] process > ANNOTATION -
[- ] process > FASTA -
[- ] process > COUNT_MATRIX_SINGLE -
[- ] process > TARGETSCAN_DATABASE -
[- ] process > MIRNA_PREDICTION -
[- ] process > MIRNA_TARGETS -
[- ] process > HISAT_ALIGN -
[- ] process > STRINGTIE -
[- ] process > DEA -
[- ] process > MULTIQC -
Error executing process > 'STAR_1PASS (null)'
Caused by:
Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)
-[nf-core/circrna] Pipeline completed with errors-
WARN: Killing pending tasks (1)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Have you ever met this before? Let me know if you need more details, thanks.
Best, Birong
Hey Birong,
It looks like you do not have internet connection on the cluster. Try pinging google from the cluster, the result should look like this..
barry@YT-1300:/data$ ping www.google.com
PING www.google.com(di-in-f106.1e100.net (2a00:1450:400b:c01::6a)) 56 data bytes
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=1 ttl=110 time=55.6 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=2 ttl=110 time=132 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=3 ttl=110 time=43.7 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 43.667/77.093/132.042/39.157 ms
If you can locate the reference genome files you need (GRCh37 FASTA, GTF files [previous runs on your laptop maybe?]) and upload them to the cluster manually, you will not need to connect to the AWS iGenomes bucket to automatically pull reference files.
Then I can look into running the pipeline 'offline' for you - I've never done it but can try to learn
Hi Barry,
I am back again! So sorry for the delay. I had a break.
Yes, you are right. The supercomputer team also told me that sometimes I was not allowed to download some external data because of the firewall. This also reminds me that sometimes I cannot even use wget
in some supercomputer partitions.
I really appreciate for your "offline" help, but I don't think I should continue to consume any more of your time and energy because of my particular case. You have done enough for me, and I really learned a lot for our conversation.
No worries, when I was trying to use your pipeline, I have run some STAR junction files, next I will try to use circular RNAs tools one by one.
Nice to meet you online! Thanks so much for you kind help all the time!
Best, Birong
Hi Barry,
I am back again! I saw you also used DCC. When I was using DCC, I got some error, could you help me to take a quick look?
Here is my scripts:
# http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
sed -i '' 's/^chr//g' GRCh38_repeat_file.gtf
head -3 GRCh38_repeat_file.gtf
# Preparation of input files for circRNA detection step
# step one: obtain reference genome: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/
# step two: repeat masker file for the genome build: GRCh38_repeatmasker.gtf.gz
http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
https://www.biostars.org/p/227979/
DCC samplesheet \
-mt1 meta1 \
-mt2 meta2 \
-D \
-R GRCh38_repeat_file.gtf \
-an GRCh38_repeatmasker.gtf \
-Pi \
-F \
-M \
-Nr 5 6 \
-fg \
-G \
-O DCC \
-A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa
Here is my STAR output:
find -L Results/data/sample -name "*_Chimeric.out.junction" > samplesheet
find -L Results/data/sample_1 -name "*_1_Chimeric.out.junction" > meta1
find -L Results/data/sample_2 -name "*_2_Chimeric.out.junction" > meta2
head -3 samplesheet
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036334/SRR9036334_2_Chimeric.out.junction
$ head -3 meta1
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036334/SRR9036334_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036315/SRR9036315_1_Chimeric.out.junction
$ head -3 meta2
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036334/SRR9036334_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036315/SRR9036315_2_Chimeric.out.junction
Here is my scripts:
DCC 0.5.0 started
44 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
WARNING: File meta2, line 1 does not contain all features.
WARNING: meta2 is probably corrupt.
WARNING: Offending line: /scratch/c.c2050857/NAFLD/GSE130970/Results/R2/SRR9036381_2_Chimeric.out.junction
Traceback (most recent call last):
File "/nfshome/store03/users/c.c2050857/.venv-circtools-detect/bin/DCC", line 11, in <module>
load_entry_point('DCC==0.5.0', 'console_scripts', 'DCC')()
File "build/bdist.linux-x86_64/egg/DCC/main.py", line 254, in main
File "build/bdist.linux-x86_64/egg/DCC/main.py", line 535, in fixall
File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 94, in fixchimerics
File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 65, in fixmate2
IndexError: list index out of range
Is there anything wrong with my scripts or the input? Thanks!
Best regards, Birong
Hey Birong,
So one or two things that might help, (but it is hard to tell from the output):
STAR
index, and used for STAR
mapping. You have provided the repeat_masker file to -an
which should be the same GTF used for STAR
i.e the full GRCh38 GTF file. @
symbol in front of samplesheet
, meta1
and meta2
. *Chimeric.out.junction
file then it might really be corrupted. Here is an example of one I have on my computer:
chr1 1767875 - chr1 9242065 + 2 1 4 simulate:21663 1767876 76S24M 9242066 24S76M1595p73M18828N27M
chr1 9243810 + chr1 11026892 + 0 0 0 simulate:21971 9242109 100M1519p82M18S 11026893 82S16M2S
chr1 15450870 + chr1 15447467 + 2 1 2 simulate:33533 15450824 46M54S 15447468 46S54M3206p100M
chr1 15450870 + chr1 15447467 + 2 1 2 simulate:33535 15447538 18M3168N82M48p16M84S 15447468 16S82M2S
chr1 15450870 + chr1 15447467 + 2 1 2 simulate:33540 15450832 38M62S 15447468 38S62M10p16M3168N84M
chr1 15450870 + chr1 15447467 + 2 1 2 simulate:33547 15447485 71M3168N27M2S69p50M50S 15447468 50S50M
chr1 15447467 - chr1 15450870 - 1 2 1 simulate:33549 15447468 59S39M2S 15450749 100M-38p59M41S
chr1 15450855 + chr1 15447532 + -1 0 0 simulate:33558 15450755 100M 15447533 23M3168N77M
chr1 15450870 + chr1 15447467 + 2 1 2 simulate:33559 15450838 32M68S 15447468 32S68M-14p34M3168N66M
chr1 15447475 - chr1 15450754 - -1 0 0 simulate:33560 15447476 80M3168N20M 15447486 70M3168N30M
(14 columns).
Good luck ,
Barry
Hi Barry,
Thanks so much for your kind reply! It helps a lot!🥳
Do you mean this? How about -R GRCh38_repeat_file.gtf and -B bam_file.txt ? Do you have any suggestions about them?
DCC @samplesheet \
-mt1 @meta1 \
-mt2 @meta2 \
-D \
-R GRCh38_repeat_file.gtf \
-an /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.103.gtf \
-Pi \
-F \
-M \
-Nr 5 6 \
-fg \
-G \
-O DCC \
-A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa
Before that, I put all the STAR output into a big directory. Today, I tried to put SRR STAR output file into the specific SRR directory. So now samplesheet has 158 lines, meta1 and mate2 have 78 lines. Is that okay? I am really confused about how to make those preparations files.😣
sample => samplesheet (_1 and _2, 158 lines)
sample_1 => meta1 (only _1, 78 lines) . (sample_2 => meta2, only _2,78 lines)
Let me try it first, thanks again!🤗
Kind regards, Birong
The way I designed DCC
in my workflow is to use the outputs from STAR
using the 2nd pass mode
.
STAR
(1st pass). sj.out.tab
files for every sample mapped in 1st pass. (these are novel junction sites)STAR
2nd pass mapping, where I include the sj.out.tab
files to help STAR
align to novel splice sites. This is done for A: paired end reads, and B: each read individuallyChimeric.out.junction
files. Using SRR9036307
as an example, DCC expects SRR9036307_Chimeric.out.junction
, SRR9036307_1_Chimeric.out.junction
and SRR9036307_2_Chimeric.out.junction
as inputs. In the workflow, for sample SRR9036307
, there are 3 inputs:
SRR9036307/SRR9036307_Chimeric.out.junction
mate1/SRR9036307_1_Chimeric.out.junction
mate2/SRR9036307_2_Chimeric.out.junction
The printf
command is simply placing these $PATHS in samplesheet, mate1 and mate2 files for DCC
- nothing special.
There is no -B
flag ;) check their documentation here: https://github.com/dieterich-lab/DCC#runnning-dcc
Barry
Hi Barry,
Thanks for your time!
It is so clear, I will try it and let you know what happens. Thanks again!
Best, Birong
Hi all,
Thanks so much for generating this useful pipeline! I wanted to find circrnas in a different way, and I found your work. But when I use it, I encounter the following problems:
Here is my code:
My fastq.gz data: Here I also have a question, is this pipeline only for fastq.gz data? Can I use fastq data?
My error:
Could you please take a look at this? Any advice would be appreciated. Thanks!
Kind regards, Birong