junchaoshi / sports1.1

Small non-coding RNA annotation Pipeline Optimized for rRNA- and tRNA-Derived Small RNAs
GNU General Public License v3.0
45 stars 16 forks source link

match genome 0 #21

Closed ANDdna1991 closed 3 years ago

ANDdna1991 commented 3 years ago

Dear Shi,

I would like to use sports1.1 to analyse a smallRNAseq dataset, but I'm finding some issues with it. I installed it following the installation recipe and all the programs seem to be functioning. However, when I run SPORT1.1 I got:

Class Sub_Class Reads Clean_Reads - 3321509 Match_Genome - 0 Unannotated_Match_Genome - 0 Unannotated_Unmatch_Genome - 3321509

So, bowtie is not mapping any read. I have tried using bowtie2 with the same datasets and they are properly mapped. As I can see the issue is not from SPORTS1.1 but with bowtie. I'm using your pre-build database.

Checking the processing report file, I can see this Error reading _rstarts[] array: 7376, 14208 but I don't find this error in google.

match to genome Error reading _rstarts[] array: 7376, 14208 Command: bowtie-align --wrapper basic-0 -f -v 0 -k 1 -p 8 --al /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_match_genome.fa --un /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_unmatch_genome.fa /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/genome/hg38/genome /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1.fa rm: cannot remove ‘/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_processed/S463_A1_R1_val_1_output_tRNA’: No such file or directory rm: cannot remove ‘/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/1_S463_A1_R1_val_1/S463_A1_R1_val_1_processed/S463_A1_R1_val_1_tRNA_mapping.txt’: No such file or directory

Any input it more than welcome! Thanks, Andres

ANDdna1991 commented 3 years ago

Hi again,

I think I solved this, I re-built the bowtie index and the error has gone, probably it was some incompatibility between bowtie versions. I have been running SPORTS1.1 for a while and, up to now, it has taken 1.13h to map a single sample. Is this normal? I didn't receive any error but I don't see the program is doing anything at all, al least looking at output files. Is that fine?

Searching input files...

/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A3/S463_A3_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A3/S463_A3_R2_val_2.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B3/S463_B3_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B3/S463_B3_R2_val_2.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B2/S463_B2_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B2/S463_B2_R2_val_2.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A1/S463_A1_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A1/S463_A1_R2_val_2.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B1/S463_B1_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_B1/S463_B1_R2_val_2.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A2/S463_A2_R1_val_1.fq /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A2/S463_A2_R2_val_2.fq

Processing input files...

/rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/fastq_trimmed/S463_A1/S463_A1_R1_val_1.fq

Thanks in advance, Andres

junchaoshi commented 3 years ago

Hi Andres,

Could you provide the parameters you used? You can find the intermediate Files in your output address during the running process.

For now, SPORTS1 does not support bowtie2. You should use bowtie1 for the mapping. The mapping speed depends on the reference database number performed. Using the multithreading parameter -p and reducing the reference database number (especially the piRNA database) can accelerate the running progress.

Hope the information helps, Junchao

ANDdna1991 commented 3 years ago

Hi Junchao,

Thank you so much for the reply. I'm using pretty standard parameters:

sports.pl -i rep_1_R1.fq.gz fastq_trimmed -p 8 -g /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/genome/hg38/genome -m /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/miRBase/21/miRBase_21-hsa -r /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/rRNAdb/human_rRNA -t /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/GtRNAdb/hg19/hg19-tRNAs -w /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/piRBase/piR_human -e /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/Ensembl/release-89/Homo_sapiens.GRCh38.ncrna -f /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/Rfam/12.3/Rfam-12.3-human -o /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/ -k

However, I get the same result if I just submit

sports.pl -i rep_1_R1.fq.gz fastq_trimmed -p 4 -g /home/ah2192/rds/rds-mrc_tox-XUr6B1Jhndg/ah2192_backup/annotations/smallRNA/Homo_sapiens/genome/hg38/genome -o /rds/project/rds-XUr6B1Jhndg/ah2192_backup/projects/small_rna_seq/data/small_RNAseq_NHS_HS/SPORTS1.1_output/ -k

It processes the first sample, however it doesn't go on and it get stuck in this sample. Here you can see that the output files has been generated.

0 Jul 5 17:07 S463_A1_R1_val_1_discarded_reads.fa 4096 Jul 5 15:41 S463_A1_R1_val_1_fa 18434804 Jul 5 17:07 S463_A1_R1_val_1.fa 6782785 Jul 5 17:07 S463_A1_R1_val_1_match_genome.fa 13967360 Jul 5 17:07 S463_A1_R1_val_1_output_match_genome 4096 Jul 5 15:41 S463_A1_R1_val_1_processed 4096 Jul 5 15:41 S463_A1_R1_val_1_result 191410015 Jul 5 17:07 S463_A1_R1_val_1_too_long_reads.fa 0 Jul 5 17:07 S463_A1_R1_val_1_too_short_reads.fa 11645049 Jul 5 17:07 S463_A1_R1_val_1_unmatch_genome.fa

I don't know which can be the problem. I have been trying to make this run for a while, but I have been unable for now. I really appreciate any input.

Thanks a lot! Andres

junchaoshi commented 3 years ago

Hi Andres,

"-i rep_1_R1.fq.gz fastq_trimmed" is not a valid input format. Compressed files need to be unpacked before input.

The valid input should be:

Input: -i Input could be: a directory (will run all qualified files in the directory recursively); a .txt file (for batch processing data, which should contain absolute path of input files or directories); a .fastq/.fq or .fasta/.fa file.

Please read the input examples in the manual.

Best, Junchao

ANDdna1991 commented 3 years ago

Yes, sorry. it was a mistake. The input was a folder that has the structure you describe in the manual:

-i fastq_trimmed

Thanks, Andres

junchaoshi commented 3 years ago

Based on the information "191410015 Jul 5 17:07 S463_A1_R1_val_1_too_long_reads.fa", it seems you didn't trim the adaptor seq.

ANDdna1991 commented 3 years ago

Hi, thank you very much for your help.

I use trim Galore to remove the adapters and I can't see them in the FastQC report (please, see the picture). Now, I also have removed the reads longer than 45bp (Cutadapt), but sports1.1 get stuck in the alignment step.

   0 Jul  6 07:26 S463_A1_R1_cutadapt_discarded_reads.fa

18434804 Jul 6 07:26 S463_A1_R1_cutadapt.fa 0 Jul 6 07:26 S463_A1_R1_cutadapt_match_genome.fa 0 Jul 6 07:26 S463_A1_R1_cutadapt_output_match_genome 0 Jul 6 07:26 S463_A1_R1_cutadapt_too_long_reads.fa 0 Jul 6 07:26 S463_A1_R1_cutadapt_too_short_reads.fa 0 Jul 6 07:26 S463_A1_R1_cutadapt_unmatch_genome.fa

Screenshot 2021-07-06 at 08 34 53

I don't understand which is the problem. I don't receive any error. I let this running all the night (40 cores) and it didn't process any further. Could you send me a small tested FASTA file to check if it's a problem of the sample or the configuration?

Thanks, Andres

junchaoshi commented 3 years ago

Could you upload the header of your .fastq file? A small test.fq file is also uploaded which does not need to trim adaptors.

test.zip

best, Junchao

junchaoshi commented 3 years ago

Please also check the intermediate processing report ([output_address]/processing_report/[input_file].txt) to see which step you are stuck in.

Best, Junchao

ANDdna1991 commented 3 years ago

Hi Junchao, really thanks for your support.

Here is one the samples, I can see they are repetitive.

@A00931:296:HC3GTDRXY:1:2124:20582:27211 1:N:0:CGATGT TCTCAGTGATGAAAACTTTGTAAAAAAAAAA + FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF @A00931:296:HC3GTDRXY:1:2147:11514:20964 1:N:0:CGATGT AAGCGGCTGTGCAGACATTCAATTGTTAAAAAAAAAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00931:296:HC3GTDRXY:1:2144:3287:35618 1:N:0:CGATGT CGGCTGTGCAGACATTCAATTGTTAAAAAAAAAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00931:296:HC3GTDRXY:1:2234:12156:3505 1:N:0:CGATGT CGGCTGTGCAGACATTCAATTGTTAAAAAAAAAA + FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF @A00931:296:HC3GTDRXY:1:2265:26160:36777 1:N:0:CGATGT CGGCTGTGCAGACATTCAATTGTTAAAAAAAAAAA +

about the processing_report:

cat 1_S463_A1_R1_cutadapt.txt Tue 6 Jul 07:26:06 BST 2021

match to genome

Thanks a lot, Andres

ANDdna1991 commented 3 years ago

Probably, the sample might be wrong but I want to be sure of my conclussion. I didn't generate these samples. I'm going to check your test.fq, where can I find this test.fq?

Thanks, Andres

junchaoshi commented 3 years ago

It seems you use ployA as the adaptor which needs to be removed.

Best, Junchao

ANDdna1991 commented 3 years ago

It's weird because I already trimmed Illumina adapter, I'm going to remove that and try again. Please, where can I find the small test.fq to be sure that the problem come from my samples.

Thanks a lot! Andres

junchaoshi commented 3 years ago

click the test.zip link to download it.

ANDdna1991 commented 3 years ago

Ok, clearly something is not working.

I've used your test.fq but the behaviour is the same.

this is the code:

sports.pl -i -p 8 test.fq -g /Homo_sapiens/genome/hg38/genome -o /rds/project/rds data/small_RNAseq_NHS_HS/SPORTS1.1_output/ -k

The processing file:

cat 1_test.txt Tue 6 Jul 08:18:25 BST 2021

match to genome

and the output files:

0 Jul 6 08:18 test_discarded_reads.fa 930 Jul 6 08:18 test.fa 0 Jul 6 08:18 test_match_genome.fa 0 Jul 6 08:18 test_output_match_genome 0 Jul 6 08:18 test_too_long_reads.fa 0 Jul 6 08:18 test_too_short_reads.fa

So, clearly I haven't configured this properly. All the programs are installed and run. I guess the problem is bowtie. I installed it from source (ver 1.2). I've no clue which can be the problem.

junchaoshi commented 3 years ago

The bowtie 1 version I am using is 1.2.3. You can test if the bowtie you used is installed properly. You can also use the precompiled bowtie to avoid installing it from the source code.

Best, Junchao