nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
44 stars 21 forks source link

[nf-core/circrna] error: Your FASTQ files do not have the appropriate extension #42

Closed BirongZhang closed 1 year ago

BirongZhang commented 2 years ago

Hi all,

Thanks so much for generating this useful pipeline! I wanted to find circrnas in a different way, and I found your work. But when I use it, I encounter the following problems:

Here is my code:

nohup nextflow run nf-core/circrna \
-r 892b3136e7432221bd81f8c7cc0400ebe541b08e \
-profile singularity \
--genome 'GRCh37' \
--input "/scratch/c.c2050857/circrna/raw_data_gz/*.fastq.gz" \
--input_type 'fastq' \
--module 'circrna_discovery' \
--tool 'ciriquant, dcc, find_circ, circexplorer2' \
--outdir Results 

My fastq.gz data: Here I also have a question, is this pipeline only for fastq.gz data? Can I use fastq data?

Screenshot 2021-10-27 at 13 45 23

My error:

Screenshot 2021-10-27 at 13 45 43

Could you please take a look at this? Any advice would be appreciated. Thanks!

Kind regards, Birong

BarryDigby commented 2 years ago

Hey there, unfortunately, the workflow has been designed to work with paired-end FASTQ files.

BirongZhang commented 2 years ago

Hi Barry, thanks so much for letting me know that.

BarryDigby commented 2 years ago

I'm looking at the experiment metadata for SRR6343608

It states the data is PAIRED ? Maybe you did not download the data correctly.

Might I suggest using nf-core/fetchngs which is an excellent workflow for downloading public datasets. Simply provide the workflow with a SRA / ENA / GEO ID, so for you it would be PRJNA420975 for the first sample. You will have to go digging for the rest..

Good luck!

BirongZhang commented 2 years ago

Hey Barry,

Thanks for your kind help! Let me check my data. I have only downloaded a few samples of this dataset and would like to try them out first. However, I still have much single-end fastq data...

Thanks again!

Kind regards, Birong

BirongZhang commented 2 years ago

I'm looking at the experiment metadata for SRR6343608

It states the data is PAIRED ? Maybe you did not download the data correctly.

Might I suggest using nf-core/fetchngs which is an excellent workflow for downloading public datasets. Simply provide the workflow with a SRA / ENA / GEO ID, so for you it would be PRJNA420975 for the first sample. You will have to go digging for the rest..

Good luck!

Hi Barry,

Thanks for letting me know about the useful pipeline nf-core/fetchngs! It works well!

You are right, it is paired end datasetSRR6343628. I don't know why my previous method didn't work.

nohup parallel -j 1 fastq-dump --skip-technical -F ::: $(cat SraAccList.txt)

Last week I tried the same data set: PRJNA420975 with nf-core/fetchngs pipeline, but I was a little confused. Perhaps this data (SRX3441728) is large and the pipeline splits the data into two parts. So when I want to merge them, should I merge them as shown below?

Screenshot 2021-11-01 at 12 53 39

Thanks again for your time and work!

Kind regards, Birong

BarryDigby commented 2 years ago

Hey Birong,

Yep that didn't work because you need to include the --split-3 command in your fatsq-dump command. This will split the mate pairs into *_1.fastq and *_2.fastq files for you. But I see you got nf-core/fetchngs working :)

Your merge strategy looks correct to me. Judging by the file sizes, they might have split SRX3441728 over two lanes to increase sequencing depth, in which case merging makes sense.

However, just to be safe, run FastQC on the SRX3441728 samples to make sure one of the lanes wasn't a bad batch.

Also, if you merge the files, check to make sure that the *_1.fastq and *_2.fastq mate files have the same number of reads. (I am pretty sure I have come across this error with the aligners in this workflow).

Best, Barry

BirongZhang commented 2 years ago

Hi Barry,

Sorry, it is me, again!

Thanks for your reply! They are all working now. I've successfully downloaded several datasets!

But I have a new problem with circrna pipeline.

I am using the supercomputer Hawk, and paired data. Here is my script:

module load nextflow/21.04.0
module load singularity

nextflow run nf-core/circrna \
-r 892b3136e7432221bd81f8c7cc0400ebe541b08e \
-profile singularity \
--genome 'GRCh37' \
--input "raw_data/SRR6343628_{1,2}.fastq.gz" \
--input_type 'fastq' \
--module 'circrna_discovery' \
--tool 'circexplorer2' 

Here is my error:

Screenshot 2021-11-09 at 19 06 18

Error executing process > 'STAR_1PASS (SRR6343628)' Caused by: Process requirement exceed available CPUs -- req: 16; avail: 8 What does this mean? Does this mean the supercomputer didn't meet Pipeline's requirements? But when I run the test data, I successfully get the result(test_outdir). What should I do?

Let me know if you need any further information. Thanks so much for your time and patient!

Best regards, Birong

BarryDigby commented 2 years ago

Hi Birong,

Don't worry about it - happy to help.

So this means that the process STAR_1PASS requested 16 CPUs, but you only have 8 CPUs available on the queue you sent the job to on Hawk. You will need to change the configuration file settings. Try the following:

  1. Make a fork of the repository.
  2. Clone the forked repository to your computer
  3. Make changes to the conf/base.config file
  4. git add . -> git commit -m "config change for hawk" -> git push
  5. Now your saved changes to the config file exist on your forked repo. When running nextflow, be sure to pull your forked repo and not my original circrna repo. i.e nextflow pull BirongZhang/circrna , nextflow run BirongZhang/circrna -r dev [...]

Here is what I mean by point 3:

/*
 * -------------------------------------------------
 *  nf-core/circrna Nextflow base config file
 * -------------------------------------------------
 * A 'blank slate' config file, appropriate for general
 * use on most high performace compute environments.
 * Assumes that all software is installed and available
 * on the PATH. Runs in `local` mode - all jobs will be
 * run on the logged in environment.
 */

process {

  cpus = { check_max( 1 * task.attempt, 'cpus' ) }
  memory = { check_max( 7.GB * task.attempt, 'memory' ) }
  time = { check_max( 12.h * task.attempt, 'time' ) }

  errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
  maxRetries = 1
  maxErrors = '-1'

  // Process-specific resource requirements
  // NOTE - Only one of the labels below are used in the fastqc process in the main script.
  //        If possible, it would be nice to keep the same label naming convention when
  //        adding in your processes.
  // TODO nf-core: Customise requirements for specific processes.
  // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors

  withLabel:process_low {
    cpus = { check_max( 2 * task.attempt, 'cpus' ) }     # This line controls CPU usage for process_low labels
    memory = { check_max( 14.GB * task.attempt, 'memory' ) }
    time = { check_max( 6.h * task.attempt, 'time' ) }
  }
  withLabel:process_medium {
    cpus = { check_max( 8 * task.attempt, 'cpus' ) }    # If your CPU max is 8, set this to 4? 
    memory = { check_max( 42.GB * task.attempt, 'memory' ) }
    time = { check_max( 8.h * task.attempt, 'time' ) }
  }
  withLabel:process_high {
    cpus = { check_max( 16 * task.attempt, 'cpus' ) }   # Alignment steps use process_high and request 16CPUs. Change this to 8 CPUS
    memory = { check_max( 84.GB * task.attempt, 'memory' ) }
    time = { check_max( 16.h * task.attempt, 'time' ) }
  }
  withLabel:process_long {
    time = { check_max( 24.h * task.attempt, 'time' ) }
  }
  withName:get_software_versions {
    cache = false
  }
  withLabel:py3{
    container = 'barryd237/py3:dev'
  }
}

You could ask your system administrator about the maximum CPU and memory capacity of Hawk so you can configure this file in such a way that it never asks for more resources than are available.

BarryDigby commented 2 years ago

@BirongZhang Going to re-open this issue because it has a lot of good troubleshooting questions in it - if that's ok?

BirongZhang commented 2 years ago

Hi Barry,

Thanks so much for your kind help!

I will try what you said before, and ask hawk team about the maximum CPU.

I will let you know what happens. Thanks again.

Best, Birong

BirongZhang commented 2 years ago

Hi Barry,

I am back. I can another highmem partition, so maybe the previous problem could be solved. But this time, new problem emerged before STAR step:

N E X T F L O W  ~  version 21.04.0
Launching `nf-core/circrna` [prickly_gautier] - revision: 892b3136e7432221bd81f8c7cc0400ebe541b08e
WARNING: Could not load nf-core/config profiles: https://raw.githubusercontent.com/nf-core/configs/master/nfcore_custom.config
WARN: There's no process matching config selector: get_software_versions

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/circrna v1.0.0
------------------------------------------------------

Input data log info:
No input sample CSV file provided, attempting to read from path instead.
Reading input data from path: raw_data/SRR6343609_{1,2}.fastq.gz

Core Nextflow options
  runName               : prickly_gautier
  containerEngine       : singularity
  container             : barryd237/circrna:dev
  launchDir             : /scratch/c.c2050857/NAFLD/GSE107650
  workDir               : /scratch/c.c2050857/NAFLD/GSE107650/work
  projectDir            : /home/c.c2050857/.nextflow/assets/nf-core/circrna
  userName              : c.c2050857
  profile               : singularity
  configFiles           : /home/c.c2050857/.nextflow/assets/nf-core/circrna/nextflow.config

Input/output options
  input                 : raw_data/SRR6343609_{1,2}.fastq.gz
  input_type            : fastq
  outdir                : GSE107650_Results

Reference genome files
  genome                : GRCh37

STAR
  chimScoreSeparation   : 10

Generic options
  max_multiqc_email_size: 25 MB

Max job request options
  max_memory            : 128 GB
  max_time              : 10d

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
WARN: Access to undefined parameter `name` -- Initialise it to a default value eg. `params.name = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
WARN: Access to undefined parameter `gtf` -- Initialise it to a default value eg. `params.gtf = some_value`
WARN: Access to undefined parameter `bowtie` -- Initialise it to a default value eg. `params.bowtie = some_value`
WARN: Access to undefined parameter `bowtie2` -- Initialise it to a default value eg. `params.bowtie2 = some_value`
WARN: Access to undefined parameter `bwa` -- Initialise it to a default value eg. `params.bwa = some_value`
WARN: Access to undefined parameter `fasta_fai` -- Initialise it to a default value eg. `params.fasta_fai = some_value`
WARN: Access to undefined parameter `hisat` -- Initialise it to a default value eg. `params.hisat = some_value`
WARN: Access to undefined parameter `star` -- Initialise it to a default value eg. `params.star = some_value`
WARN: Access to undefined parameter `segemehl` -- Initialise it to a default value eg. `params.segemehl = some_value`
[-        ] process > SOFTWARE_VERSIONS -
[-        ] process > BWA_INDEX         -
[-        ] process > SAMTOOLS_INDEX    -
[-        ] process > HISAT2_INDEX      -
[-        ] process > STAR_INDEX        -
[-        ] process > BOWTIE_INDEX      -
[-        ] process > BOWTIE2_INDEX     -
[-        ] process > SEGEMEHL_INDEX    -
[-        ] process > FILTER_GTF        -
[-        ] process > CIRIQUANT_YML     -
[-        ] process > GENE_ANNOTATION   -
[-        ] process > BAM_TO_FASTQ      -
[-        ] process > FASTQC_RAW        -
[-        ] process > BBDUK             -
[-        ] process > FASTQC_BBDUK      -
[-        ] process > CIRIQUANT         -
[-        ] process > STAR_1PASS        -
[-        ] process > SJDB_FILE         -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [  0%] 0 of 1
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
WARN: Access to undefined parameter `circexplorer2_annotation` -- Initialise it to a default value eg. `params.circexplorer2_annotation = some_value`

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -

Error executing process > 'STAR_1PASS (null)'

Caused by:
 Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)

executor >  local (2)
[f6/4d3ce9] process > SOFTWARE_VERSIONS       [100%] 1 of 1 ✔
[-        ] process > BWA_INDEX               -
[-        ] process > SAMTOOLS_INDEX          -
[-        ] process > HISAT2_INDEX            -
[-        ] process > STAR_INDEX              -
[-        ] process > BOWTIE_INDEX            -
[-        ] process > BOWTIE2_INDEX           -
[-        ] process > SEGEMEHL_INDEX          -
[-        ] process > FILTER_GTF              -
[-        ] process > CIRIQUANT_YML           -
[-        ] process > GENE_ANNOTATION         -
[-        ] process > BAM_TO_FASTQ            -
[e3/10de03] process > FASTQC_RAW (SRR6343609) [  0%] 0 of 1
[-        ] process > BBDUK                   -
[-        ] process > FASTQC_BBDUK            -
[-        ] process > CIRIQUANT               -
[-        ] process > STAR_1PASS              -
[-        ] process > SJDB_FILE               -
[-        ] process > STAR_2PASS              -
[-        ] process > CIRCEXPLORER2           -
[-        ] process > CIRCRNA_FINDER          -
[-        ] process > DCC_MATE1               -
[-        ] process > DCC_MATE2               -
[-        ] process > DCC                     -
[-        ] process > FIND_ANCHORS            -
[-        ] process > FIND_CIRC               -
[-        ] process > MAPSPLICE_ALIGN         -
[-        ] process > MAPSPLICE_PARSE         -
[-        ] process > SEGEMEHL_ALIGN          -
[-        ] process > ANNOTATION              -
[-        ] process > FASTA                   -
[-        ] process > COUNT_MATRIX_SINGLE     -
[-        ] process > TARGETSCAN_DATABASE     -
[-        ] process > MIRNA_PREDICTION        -
[-        ] process > MIRNA_TARGETS           -
[-        ] process > HISAT_ALIGN             -
[-        ] process > STRINGTIE               -
[-        ] process > DEA                     -
[-        ] process > MULTIQC                 -
Error executing process > 'STAR_1PASS (null)'

Caused by:
  Connect to ngi-igenomes.s3.amazonaws.com:443 [ngi-igenomes.s3.amazonaws.com/52.218.112.154] failed: Network is unreachable (connect failed)

-[nf-core/circrna] Pipeline completed with errors-
WARN: Killing pending tasks (1)
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

Have you ever met this before? Let me know if you need more details, thanks.

Best, Birong

BarryDigby commented 2 years ago

Hey Birong,

It looks like you do not have internet connection on the cluster. Try pinging google from the cluster, the result should look like this..

barry@YT-1300:/data$ ping www.google.com
PING www.google.com(di-in-f106.1e100.net (2a00:1450:400b:c01::6a)) 56 data bytes
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=1 ttl=110 time=55.6 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=2 ttl=110 time=132 ms
64 bytes from di-in-f106.1e100.net (2a00:1450:400b:c01::6a): icmp_seq=3 ttl=110 time=43.7 ms
^C
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 43.667/77.093/132.042/39.157 ms
BarryDigby commented 2 years ago

If you can locate the reference genome files you need (GRCh37 FASTA, GTF files [previous runs on your laptop maybe?]) and upload them to the cluster manually, you will not need to connect to the AWS iGenomes bucket to automatically pull reference files.

Then I can look into running the pipeline 'offline' for you - I've never done it but can try to learn

BirongZhang commented 2 years ago

Hi Barry,

I am back again! So sorry for the delay. I had a break.

Yes, you are right. The supercomputer team also told me that sometimes I was not allowed to download some external data because of the firewall. This also reminds me that sometimes I cannot even use wget in some supercomputer partitions.

I really appreciate for your "offline" help, but I don't think I should continue to consume any more of your time and energy because of my particular case. You have done enough for me, and I really learned a lot for our conversation.

No worries, when I was trying to use your pipeline, I have run some STAR junction files, next I will try to use circular RNAs tools one by one.

Nice to meet you online! Thanks so much for you kind help all the time!

Best, Birong

BirongZhang commented 2 years ago

Hi Barry,

I am back again! I saw you also used DCC. When I was using DCC, I got some error, could you help me to take a quick look?

Here is my scripts:

# http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
sed -i  '' 's/^chr//g' GRCh38_repeat_file.gtf
head -3 GRCh38_repeat_file.gtf

# Preparation of input files for circRNA detection step
# step one: obtain reference genome: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/

# step two: repeat masker file for the genome build: GRCh38_repeatmasker.gtf.gz
http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1221177343_h3fzKaDey7mY9G3uYurJpQBBXJ1S&clade=mammal&org=Human&db=hg38&hgta_group=rep&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C560%2C138-15%2C602%2C945&hgta_outputType=primaryTable&hgta_outFileName=UCSC
https://www.biostars.org/p/227979/

DCC samplesheet \
      -mt1 meta1 \
      -mt2 meta2 \
      -D \
      -R GRCh38_repeat_file.gtf \
      -an GRCh38_repeatmasker.gtf \
      -Pi \
      -F \
      -M \
      -Nr 5 6 \
      -fg \
      -G \
      -O DCC \
      -A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Here is my STAR output:

find -L Results/data/sample -name "*_Chimeric.out.junction" > samplesheet
find -L Results/data/sample_1 -name "*_1_Chimeric.out.junction" > meta1
find -L Results/data/sample_2 -name "*_2_Chimeric.out.junction" > meta2

head -3 samplesheet
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample/SRR9036334/SRR9036334_2_Chimeric.out.junction

$ head -3 meta1
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036347/SRR9036347_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036334/SRR9036334_1_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_1/SRR9036315/SRR9036315_1_Chimeric.out.junction

$ head -3 meta2
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036347/SRR9036347_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036334/SRR9036334_2_Chimeric.out.junction
/scratch/c.c2050857/NAFLD/GSE130970/Results/data/sample_2/SRR9036315/SRR9036315_2_Chimeric.out.junction
Screenshot 2021-12-01 at 00 54 12

Here is my scripts:

DCC 0.5.0 started
44 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
WARNING: File meta2, line 1 does not contain all features.
WARNING: meta2 is probably corrupt.
WARNING: Offending line: /scratch/c.c2050857/NAFLD/GSE130970/Results/R2/SRR9036381_2_Chimeric.out.junction
Traceback (most recent call last):
  File "/nfshome/store03/users/c.c2050857/.venv-circtools-detect/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.5.0', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 254, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 535, in fixall
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 94, in fixchimerics
  File "build/bdist.linux-x86_64/egg/DCC/fix2chimera.py", line 65, in fixmate2
IndexError: list index out of range

Is there anything wrong with my scripts or the input? Thanks!

Best regards, Birong

BarryDigby commented 2 years ago

Hey Birong,

So one or two things that might help, (but it is hard to tell from the output):

Here is an example of one I have on my computer:

chr1    1767875 -   chr1    9242065 +   2   1   4   simulate:21663  1767876 76S24M  9242066 24S76M1595p73M18828N27M
chr1    9243810 +   chr1    11026892    +   0   0   0   simulate:21971  9242109 100M1519p82M18S 11026893    82S16M2S
chr1    15450870    +   chr1    15447467    +   2   1   2   simulate:33533  15450824    46M54S  15447468    46S54M3206p100M
chr1    15450870    +   chr1    15447467    +   2   1   2   simulate:33535  15447538    18M3168N82M48p16M84S    15447468    16S82M2S
chr1    15450870    +   chr1    15447467    +   2   1   2   simulate:33540  15450832    38M62S  15447468    38S62M10p16M3168N84M
chr1    15450870    +   chr1    15447467    +   2   1   2   simulate:33547  15447485    71M3168N27M2S69p50M50S  15447468    50S50M
chr1    15447467    -   chr1    15450870    -   1   2   1   simulate:33549  15447468    59S39M2S    15450749    100M-38p59M41S
chr1    15450855    +   chr1    15447532    +   -1  0   0   simulate:33558  15450755    100M    15447533    23M3168N77M
chr1    15450870    +   chr1    15447467    +   2   1   2   simulate:33559  15450838    32M68S  15447468    32S68M-14p34M3168N66M
chr1    15447475    -   chr1    15450754    -   -1  0   0   simulate:33560  15447476    80M3168N20M 15447486    70M3168N30M

(14 columns).

Good luck ,

Barry

BirongZhang commented 2 years ago

Hi Barry,

Thanks so much for your kind reply! It helps a lot!🥳

Do you mean this? How about -R GRCh38_repeat_file.gtf and -B bam_file.txt ? Do you have any suggestions about them?

DCC @samplesheet \
      -mt1 @meta1 \
      -mt2 @meta2 \
      -D \
      -R GRCh38_repeat_file.gtf \
      -an /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.103.gtf \
      -Pi \
      -F \
      -M \
      -Nr 5 6 \
      -fg \
      -G \
      -O DCC \
      -A /scratch/c.c2050857/reference/reference_Human/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Before that, I put all the STAR output into a big directory. Today, I tried to put SRR STAR output file into the specific SRR directory. So now samplesheet has 158 lines, meta1 and mate2 have 78 lines. Is that okay? I am really confused about how to make those preparations files.😣

sample => samplesheet (_1 and _2, 158 lines)

Screenshot 2021-12-01 at 18 03 17

sample_1 => meta1 (only _1, 78 lines) . (sample_2 => meta2, only _2,78 lines)

Screenshot 2021-12-01 at 18 04 57

Let me try it first, thanks again!🤗

Kind regards, Birong

BarryDigby commented 2 years ago

The way I designed DCC in my workflow is to use the outputs from STAR using the 2nd pass mode.

  1. Map both reads to genome using STAR (1st pass).
  2. Collect all sj.out.tab files for every sample mapped in 1st pass. (these are novel junction sites)
  3. Perform STAR 2nd pass mapping, where I include the sj.out.tab files to help STAR align to novel splice sites. This is done for A: paired end reads, and B: each read individually
  4. Collect the Chimeric.out.junction files. Using SRR9036307 as an example, DCC expects SRR9036307_Chimeric.out.junction, SRR9036307_1_Chimeric.out.junction and SRR9036307_2_Chimeric.out.junction as inputs.

In the workflow, for sample SRR9036307, there are 3 inputs:

SRR9036307/SRR9036307_Chimeric.out.junction
mate1/SRR9036307_1_Chimeric.out.junction
mate2/SRR9036307_2_Chimeric.out.junction

The printf command is simply placing these $PATHS in samplesheet, mate1 and mate2 files for DCC - nothing special.

There is no -B flag ;) check their documentation here: https://github.com/dieterich-lab/DCC#runnning-dcc

Barry

BirongZhang commented 2 years ago

Hi Barry,

Thanks for your time!

It is so clear, I will try it and let you know what happens. Thanks again!

Best, Birong