sestaton / transposome-scripts

Utility scripts using for working with Transposome
MIT License
2 stars 0 forks source link

repbase superfamilies not being identified. #1

Open asifzubair opened 8 years ago

asifzubair commented 8 years ago

Hi,

I tried using format_database.pl to format the repbase data. I basically concatenated all the references.

cat RepBase21.01.fasta/*.ref RepBase21.01.fasta/appendix/*.ref > repbase_all.fasta

and then called format_database.pl

format_database.pl -i repbase_20.11_all.fasta -o output.fa -g genus_placeholder -s species_placeholder 2>error.log

However, after the call - most of the sequences were rejected saying that the Superfamily is not recognized. The log file is here - error.log.zip .

Is it that I have to use an older version of the repbase database because the convention is not being followed any more ?

Please let me know.

Thanks !

a.

sestaton commented 8 years ago

Thanks for the message, I see the problem. This script is designed to take a file of user-generated repeats from one species and format it like RepBase data. Also, the headers are expected to have the standard prefixes for the superfamily, such as RLC for Copia, RLG for Gypsy, etc.

Since your data is in RepBase format I think you are good to use it as is. The important part is the ">family superfamily genus species" format (separated by tabs), but that is probably already the case. The header format is important because the results are summarized for different taxonomic levels. My suggestion is to try to use the data as is, and let me know if there are any issues.

asifzubair commented 8 years ago

Thank you for the quick reply. I'll do just that and report back.

asifzubair commented 8 years ago

hi evan,

you were right - repbase data can be used as-is. however, i had a follow-up question and didn't know where to pose it, so asking here:

what is roughly the memory allocation required for the clustering step in relation to number of reads. i am using 100K PE reads and the analysis breaks at clustering even with 100gb memory. how much should i allocate to the program ? i remember in your paper you had data for up to 1M reads, so i was wondering if you could give me some pointers there.

i did try setting the in_memory flag to 0 but then the analysis is really slow - it's been 6 hours and still the Louvain method hasn't started.

any help would be really appreciated. thanks !

also, please feel free to close the issue if you like, or please let me know and i'll do it.

sestaton commented 8 years ago

Sorry for not responding sooner, I can help with this issue.

The running time and memory usage has a lot to do with the amount of simple repeats or other contaminants, like organellar DNA that might be high coverage. If there is a large proportion of simple repeats in the data then it is my experience that you can see much longer run times and higher memory usage (the same is true for organellar contamination). That being said, the conditions you describe are a bit odd. It should only take a few minutes and very little RAM for large, complex genomes like sunflower or maize with just 100k reads (unfiltered).

What I would try first is to remove organellar and rDNA contamination, and try to estimate the ratio simple repeats in your data. The removal of rDNA is not usually something I do, but I have seen samples where it caused longer run times if it was in high abundance (being composed of arrays of simple repeats). Also, the assumption is that you have WGS data, so if this is pooled or enriched libraries of any kind then you will have problems according to what I've gathered from other users.

Let me know if the filtering helps or if you have other questions. If these steps help then I will add more details to the documentation about pre-processing the data.

asifzubair commented 8 years ago

thank you, evan. I'll try to filter for small repeats and organellar DNA. would you advise using something like DUST to remove simple repeats ?

even with 256gb of memory, transposome did not finish - attached is the log -

(0)> cat t_results_100K_out/t_log.txt 
======== Transposome version: 0.09.8 (started at: 23-02-2016 15:12:05) ========
Configuration - Log file for monitoring progress and errors: t_log.txt
Configuration - Sequence file:                               sub.interl.100K.fq
Configuration - Sequence format:                             fastq
Configuration - Sequence number for each BLAST process:      10000
Configuration - Number of CPUs per thread:                   2
Configuration - Number of threads:                           4
Configuration - Output directory:                            t_results_100K_out
Configuration - In-memory analysis:                          1
Configuration - Percent identity for matches:                90
Configuration - Fraction coverage for pairwise matches:      0.55
Configuration - Merge threshold for clusters:                100
Configuration - Minimum cluster size for annotation:         100
Configuration - BLAST e-value threshold for annotation:      10
Configuration - Repeat database for annotation:              /home/cmb-07/sn1/asifzuba/transposons/data/repbase/repbase_green_plants.fa
Configuration - Log file for clustering/merging results:     t_cluster_report.txt
Transposome::Run::Blast::run_allvall_blast started at:   23-02-2016 15:12:06.
Transposome::Run::Blast::run_allvall_blast finished running mgblast on 200000 sequences in 145.28 minutes.
Transposome::Run::Blast::run_allvall_blast completed at: 23-02-2016 17:37:23. Final output file is:
sub.interl.100K_allvall_blast.bln.
Transposome::PairFinder::parse_blast started at:   23-02-2016 17:37:28.
Transposome::PairFinder::parse_blast completed at: 23-02-2016 22:44:23.
Final output files are:
sub.interl.100K_allvall_blast_louvain.int,
sub.interl.100K_allvall_blast_louvain.idx,
sub.interl.100K_allvall_blast_louvain.edges.
Transposome::Cluster::louvain_method started at:         23-02-2016 22:54:03.
grep failed. Caught error: "grep -c level t_results_100K_out/sub.interl.100K_allvall_blast_louvain.tree.log" failed to start: "Cannot allocate memory" at /home/cmb-07/sn1/asifzuba/perl5/lib/perl5/Transposome/Cluster.pm line 139.
.

however, my samples with 20K reads finished really quickly, and i thought i'd add that log here as well:

(0)> cat transposome_results_out_m20/t_log.txt
======== Transposome version: 0.09.8 (started at: 17-02-2016 12:52:34) ========
Configuration - Log file for monitoring progress and errors: t_log.txt
Configuration - Sequence file:                               sub.interl.20K.fq
Configuration - Sequence format:                             fastq
Configuration - Sequence number for each BLAST process:      2000
Configuration - Number of CPUs per thread:                   4
Configuration - Number of threads:                           2
Configuration - Output directory:                            transposome_results_out_m20
Configuration - In-memory analysis:                          1
Configuration - Percent identity for matches:                90
Configuration - Fraction coverage for pairwise matches:      0.55
Configuration - Merge threshold for clusters:                20
Configuration - Minimum cluster size for annotation:         100
Configuration - BLAST e-value threshold for annotation:      10
Configuration - Repeat database for annotation:              /home/cmb-07/sn1/asifzuba/transposons/data/repbase/repbase_green_plants.fa
Configuration - Log file for clustering/merging results:     t_cluster_report.txt
Transposome::Run::Blast::run_allvall_blast started at:   17-02-2016 12:52:34.
Transposome::Run::Blast::run_allvall_blast finished running mgblast on 40000 sequences in 10.55 minutes.
Transposome::Run::Blast::run_allvall_blast completed at: 17-02-2016 13:03:08. Final output file is:
sub.interl.20K_allvall_blast.bln.
Transposome::PairFinder::parse_blast started at:   17-02-2016 13:03:08.
Transposome::PairFinder::parse_blast completed at: 17-02-2016 13:21:27.
Final output files are:
sub.interl.20K_allvall_blast_louvain.int,
sub.interl.20K_allvall_blast_louvain.idx,
sub.interl.20K_allvall_blast_louvain.edges.
Transposome::Cluster::louvain_method started at:         17-02-2016 13:21:55.
Transposome::Cluster::louvain_method completed at:       17-02-2016 13:46:20.
Transposome::Cluster::make_clusters started at:          17-02-2016 13:46:20.
Transposome::Cluster::make_clusters completed at:        17-02-2016 13:49:33.
Transposome::Cluster::find_pairs started at:             17-02-2016 13:50:54.
Transposome::Cluster::find_pairs completed at:           17-02-2016 13:50:54.
Transposome::Cluster::merge_clusters started at:         17-02-2016 13:50:55.
Transposome::Cluster::merge_clusters completed at:       17-02-2016 13:50:55.
Results - Total number of clustered reads:  19420.
Transposome::Annotation::annotate_clusters started at:   17-02-2016 13:50:58.
Transposome::Annotation::annotate_clusters completed at: 17-02-2016 13:51:26.
Results - Total sequences:                        40000
Results - Total sequences clustered:              19420
Results - Total sequences unclustered:            20580
Results - Repeat fraction from clusters:          0.4855
Results - Singleton repeat fraction:              0.0335762876579203
Results - Total repeat fraction:                  0.502775
Results - Total repeat fraction from annotations: 0.485500000016506
Transposome::Annotation::clusters_annotation_to_summary started at:   17-02-2016 13:51:26.
Transposome::Annotation::clusters_annotation_to_summary completed at: 17-02-2016 13:51:26.
======== Transposome completed at: 17-02-2016 13:51:31. Elapsed time: 58 minutes, 56 seconds. ========

do you think the grep error is a memory issue ? also, I have a lot of repeats in my repeat database - do you think this will affect runtime/memory ? i also tried looking for the file sub.interl.100K_allvall_blast_louvain.tree.log in the output results for 100K reads but could not find it. Could this be a memory issue ?

also, a clarification - when i say 20K or 100K read i mean paired-end reads - so effectively 40K and 200K respectively.

thank you for your help ! really appreciate it.

best, asif

sestaton commented 8 years ago

Yes, it does appear to be a memory issue. If you can tell me the size of the BLAST output file (with du -h *bln) that would tell us the expected memory usage. Still, I am very surprised at this result and I would be interested to see the sequence composition of your data.

That log file is created during the clustering and if it failed to initialize, would not have bee created. Both of those appear to be memory issues. Sorry for the troubles.

asifzubair commented 8 years ago

Hi Evan, I can't seem to find the blast output file. Here is what the output directory looks like:

asifzuba@hpc-cmb(~/chickpea/transposons/transposome/t_results_100K_out)
(0)> ll
total 34G
drwxrws--x  2 asifzuba lc_sn1 4.0K Feb 25 00:15 .
drwxrwsr-x 10 asifzuba lc_sn1 4.0K Feb 25 00:15 ..
-rw-rw-r--  1 asifzuba lc_sn1  28G Feb 23 22:44 sub.interl.100K_allvall_blast_louvain.edges
-rw-rw-r--  1 asifzuba lc_sn1 5.7M Feb 23 22:01 sub.interl.100K_allvall_blast_louvain.idx
-rw-rw-r--  1 asifzuba lc_sn1 6.5G Feb 23 22:44 sub.interl.100K_allvall_blast_louvain.int
-rw-rw-r--  1 asifzuba lc_sn1 2.1K Feb 23 22:54 t_log.txt

I also checked my successful run and couldn't find .bln file. this is what the output directory looks like:

asifzuba@hpc-cmb(~/chickpea/transposons/transposome/t_results_50K_out)
(0)> ll
total 6.1M
drwxrwsr-x  2 asifzuba lc_sn1 4.0K Feb 25 00:18 .
drwxrwsr-x 10 asifzuba lc_sn1 4.0K Feb 25 00:15 ..
-rw-rw-r--  1 asifzuba lc_sn1 1.1K Feb 24 05:00 sub.interl.50K_allvall_blast_louvain_cls_fasta_files_02_24_2016_04_59_46_annotations.tgz
-rw-rw-r--  1 asifzuba lc_sn1 3.5M Feb 24 05:00 sub.interl.50K_allvall_blast_louvain_cls_fasta_files_02_24_2016_04_59_46.tgz
-rw-rw-r--  1 asifzuba lc_sn1 2.3M Feb 24 04:59 sub.interl.50K_allvall_blast_louvain_merged_02_24_2016_04_59_46.cls
-rw-rw-r--  1 asifzuba lc_sn1 1.4K Feb 24 03:18 sub.interl.50K_allvall_blast_louvain.tree.log
-rw-rw-r--  1 asifzuba lc_sn1  16K Feb 25 00:14 t_cluster_report_annotations_summary.tsv
-rw-rw-r--  1 asifzuba lc_sn1 1.3K Feb 25 00:14 t_cluster_report_annotations.tsv
-rw-rw-r--  1 asifzuba lc_sn1 5.0K Feb 25 00:14 t_cluster_report_singletons_annotations_summary.tsv
-rw-rw-r--  1 asifzuba lc_sn1  80K Feb 25 00:14 t_cluster_report_singletons_annotations.tsv
-rw-rw-r--  1 asifzuba lc_sn1  45K Feb 25 00:14 t_cluster_report.txt
-rw-rw-r--  1 asifzuba lc_sn1 3.3K Feb 25 00:14 t_log.txt

is this expected ?

thanks for the help !

asifzubair commented 8 years ago

Hi Evan,

running the analysis again, after I removed organelle DNA. the analysis is still going on, but I checked the size of the .bln file. it is 115gb.

asifzuba@hpc-cmb(~/chickpea/transposons/transposome/t_results_100K_decon_out)
(0)> ll
total 115G
drwxrws--x  2 asifzuba lc_sn1 4.0K Feb 29 02:44 .
drwxrwsr-x 13 asifzuba lc_sn1 4.0K Feb 29 02:56 ..
-rw-rw-r--  1 asifzuba lc_sn1 115G Feb 29 02:44 sub.interl.100K.decon_allvall_blast.bln
-rw-rw-r--  1 asifzuba lc_sn1 1.6K Feb 29 02:45 t_log.txt

a.

sestaton commented 8 years ago

That is quite large for that number of reads. You can expect it to use about that much RAM (115 GB). Were you able to determine the level of simple repeats? Filtering simple repeats and rDNA will likely have a large effect on the number of matches.

asifzubair commented 8 years ago

Hi Evan,

I filtered out all organelle DNA (chloroplast, mitochondrial and ribosomal) and masked all simple repeats using RepeatMasker ( i used the -noint option, which masks only low complexity regions and simple repeats). RepeatMasker said that there were around 2.5% simple repeats + low complexity regions.

However, the size of my blast file only decreased by 1GB. So, still going strong at 114G.

Is it possible that you could take a look a the .bln file to see if something is systematically wrong ?

Thanks!

sestaton commented 8 years ago

I would be happy to take a look at it for you. You have done all the steps I would try, at this stage it may be easy for me to see if I can find anything unusual.

I will create an instance for you to upload the FASTA file (that will be easier than moving the BLAST file around), but it is a bit late to be starting this right now so I'll send you the info tomorrow to upload the file.

Thanks, Evan

asifzubair commented 8 years ago

Thank you, Evan ! That would be great !

sestaton commented 8 years ago

Okay, you can upload the file to 159.203.252.57 with the username 'asif'. For example,

sftp asif@159.203.252.57

The password is "G3x4glrLnBw5zibJ8geF" (when prompted) and you can use "put" to upload the file. I'll check it in the morning. Thanks.

asifzubair commented 8 years ago

Thank you, Evan !

I uploaded three files:

sub.interl.100K.fa   # subsampled FASTA file
sub.interl.100K.filtered.fa   # filtered file, where I removed all hits against organelle DNA
sub.interl.100K.filtered.fa.masked   # the filtered file with masked simple repeats

Really appreciate the help.

sestaton commented 8 years ago

Sorry for the delay! I'll get back to you soon about this issue.

asifzubair commented 8 years ago

Sure. Thank you, Evan.

sestaton commented 8 years ago

Hi Asif,

I have looked at the sequence files and it appears to be a problem with simple repeats (mono- and di-nucleotide repeats). Here is an assessment of the unfiltered FASTA file you sent:

Total sequences                        : 200000

===> Calculating simple repeats at 30% : 
Num simple repeats filtered at 30%     : 194842
Repeat fraction at 30%                 : 97.42

===> Calculating simple repeats at 40% : 
Num simple repeats filtered at 40%     : 111120
Repeat fraction at 40%                 : 55.56

===> Calculating simple repeats at 50% : 
Num simple repeats filtered at 50%     : 11305
Repeat fraction at 50%                 : 5.65

===> Calculating simple repeats at 60% : 
Num simple repeats filtered at 60%     : 934
Repeat fraction at 60%                 : 0.47

===> Calculating simple repeats at 70% : 
Num simple repeats filtered at 70%     : 78
Repeat fraction at 70%                 : 0.04

===> Calculating simple repeats at 80% : 
Num simple repeats filtered at 80%     : 23
Repeat fraction at 80%                 : 0.01

===> Calculating simple repeats at 90% : 
Num simple repeats filtered at 90%     : 7
Repeat fraction at 90%                 : 0.00

And for the same sample size from sunflower:

Total sequences                        : 200000

===> Calculating simple repeats at 30% : 
Num simple repeats filtered at 30%     : 185206
Repeat fraction at 30%                 : 92.60

===> Calculating simple repeats at 40% : 
Num simple repeats filtered at 40%     : 56341
Repeat fraction at 40%                 : 28.17

===> Calculating simple repeats at 50% : 
Num simple repeats filtered at 50%     : 5686
Repeat fraction at 50%                 : 2.84

===> Calculating simple repeats at 60% : 
Num simple repeats filtered at 60%     : 386
Repeat fraction at 60%                 : 0.19

===> Calculating simple repeats at 70% : 
Num simple repeats filtered at 70%     : 56
Repeat fraction at 70%                 : 0.03

===> Calculating simple repeats at 80% : 
Num simple repeats filtered at 80%     : 9
Repeat fraction at 80%                 : 0.00

===> Calculating simple repeats at 90% : 
Num simple repeats filtered at 90%     : 1
Repeat fraction at 90%                 : 0.00

If you look at the reads that are above 40% simple repeats there is about 2X the amount in your sample than sunflower, which has a large and complex genome. This is what would be causing the issues.

To solve this, I would try to take additional random samples of the data and see if you see the same behaviour. I have noticed that some random samples are problematic like the above data while others are not. Taking multiple samples is recommended for confidence also. This is something you can try that won't take long at all. Please let me know if you see any differences between samples.

The script I used for the above analysis is here: get_simple_repeats.pl.

Cheers, Evan

asifzubair commented 8 years ago

Thank you, Evan.

I tried to take some random samples from my data and check the number of simple repeats fraction based on the script you provided. However, focussing on the simple metric of Repeat fraction at 40%, the value was always above 55%:

# results for 10 random samples
sub.interl.s11149.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s11149.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 110624
sub.interl.s11149.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.31
sub.interl.s11440.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s11440.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111507
sub.interl.s11440.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.75
sub.interl.s13380.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s13380.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111399
sub.interl.s13380.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.70
sub.interl.s15146.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s15146.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111087
sub.interl.s15146.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.54
sub.interl.s15369.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s15369.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111293
sub.interl.s15369.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.65
sub.interl.s17479.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s17479.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111112
sub.interl.s17479.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.56
sub.interl.s19467.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s19467.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111458
sub.interl.s19467.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.73
sub.interl.s21594.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s21594.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111196
sub.interl.s21594.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.60
sub.interl.s24186.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s24186.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 110936
sub.interl.s24186.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.47
sub.interl.s25958.110621.100K.repeats.log:===> Calculating simple repeats at 40% : 
sub.interl.s25958.110621.100K.repeats.log:Num simple repeats filtered at 40%     : 111243
sub.interl.s25958.110621.100K.repeats.log:Repeat fraction at 40%                 : 55.62

Another thing - is it correct to say that your definition of simple repeats is the fraction of mono- or di- nucleotides in the read ? I ask this because I had thought that simple repeats means - number of tandem repeats of mono- or di- nucleotides. And actually, these are the sort of repeats that I masked using RepeatMasker. Please let me know if my understanding is correct.

Do you think, given the fraction of simple repeats, decreasing the BLAST E-value would have a significant effect on the size of the blast file created? Currently, my E-value is set to 10.

Thank you for your help! Best,

Asif

sestaton commented 8 years ago

Have you taken a look at the raw FASTQ files in the application FASTQC? I ask because you might have adapters present in a high enough abundance to cause issues. FASTQC will report adapter contamination, so it might be worth a quick look.

Yes, the repeats being calculated in that script are the percentage of mono- and di-nucleotide repeats in each read. If over half of the reads contain >40% simple repeats this will lead to millions of BLAST hits, which is why the BLAST files are so large. This will then cause a large amount of memory consumption in the next step, which is calculating all the pairwise matches and clustering them based on weights.

You might want to try at a very low rate of sampling, like 20k reads, and increase as is possible. I will experiment with the data you sent. Sorry there is not a simpler solution, but this is something that would affect all other programs as well. I'll try to work on a solution to this issue.

One more thing, I would recommend doing a k-mer analysis to look at uniqueness of your reads. A comparison to other species would be informative here to see why there is low complexity in the data set. That is something I can do with genomes I have on hand, and I'll try to upload an image this week if I get some time.

asifzubair commented 8 years ago

Thank you, Evan.

I did run fastqc on the reads that I sent you and it did not show any adapter contamination. However, some other samples that I was running did have adapter contamination - thank you for reminding me to check this.

As a consequence, I was able to run transposome on 25K reads for the sample that I had sent you. However, my samples with adapter contamination failed for even 25K reads. I'll try to remove adapters and rerun those samples.

I think we did some k-mer analysis and found many repeats. However, it would be great to have a comparative analysis with other genomes. It would be really helpful. Thank you.

sestaton commented 8 years ago

I'm glad you got it working at that level at least. How were the results in terms of repeat abundance? I'm surprised by the challenges with this sample which made me think it might be some kind of artifact. I'll try to do that k-mer analysis because it should be informative about this issue.