Closed callumparr closed 1 year ago
Ah OK so it was doing something but then when it started to update database it had many errors.
[ 2023-06-11 16:51:28 ] All jobs complete. Starting database update.
[ 2023-06-11 17:20:18 ] Validating database........
Database counter for 'genes' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 153594
counter_value: 172923
Database counter for 'transcripts' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 1644012
counter_value: 2021938
Database counter for 'location' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2116074
counter_value: 2397222
Database counter for 'edge' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 3020322
counter_value: 3559795
Database counter for 'observed' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 164632280
counter_value: 173941741
Traceback (most recent call last):
File "/home/callum/miniconda3/bin/talon", line 33, in <module>
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2464, in main
end_support = parse_custom_SAM_tags(sam_record)
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 1781, in update_database
# get overlap and compare
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2095, in check_database_integrity
except Exception as e:
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
/analysisdata/fantom6/Interactome/ONT-CAGE_TALON_dorado/scripts/talon.sh: line 22: rep1: command not found
/tmp/F6_interactome_neurogenesis_QC.log: 74.5% -- replaced with /tmp/F6_interactome_neurogenesis_QC.log.gz
gzip: /tmp/*talon_read_annot.tsv: No such file or directory
Hey, my suggestion when dealing with this much data is to run TALON sequentially. I have had luck with running it on 100s of millions of reads if I run ~40 million reads at a time.
Hi @fairliereese thanks for the reply!
I am trying to get all samples (context) in at once. So I instead now load in per chr to reduce size of the data TALON has to handle, so basically running TALON 25 times to include the major chr contigs. I hope this doesn't break some logic of how talon works.
@fairliereese
To speed up the database generation I took two tacs but both involved splitting all samples alignments to chromosomes and running them either a) sequentially into the same database, one chr config at a time, b) or in parallel creating a database for each chr and adding a prefix to TALON. The latter is obviously faster to generate all the annotations but then it means having to do a lot of downstream work handling the different talon.db. Given that each has the same hg38 build and gencode v39 annotation in the talon initializing. Is it possible to merge these into one database? There would be overlap would be for the initial gencode annotations from initalizing a database for each chr.
Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.
Currently there is no way within TALON to merge transcripts from separate databases. There are however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.
Y
Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.
Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.
Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.
Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?
I can't think of any downsides to this off the top of my head.
On Fri, Jun 16, 2023, 22:21 callumparr @.***> wrote:
Y
Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.
Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus https://github.com/mortazavilab/cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.
Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.
Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?
— Reply to this email directly, view it on GitHub https://github.com/mortazavilab/TALON/issues/131#issuecomment-1595628390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBGIWN4ISGHRWXQF7M6XNLXLU5GLANCNFSM6AAAAAAZBQCBP4 . You are receiving this because you were mentioned.Message ID: @.***>
Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.
Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.
I extracted an abundance file for each chr.db. can I not simply rbind the results .tsv files and it will be like a chr sorted abundance file? Counts for each isoform should only appear once, as they are located on one chr only.
Sorry, I probably misunderstood your point.
Every time I run a new database from the same gencode annotation, TALON will assign the same index to these known annotations right?
Yes, but you will run the risk of having duplicated transcript IDs. For instance, novel transcript number 1 from chromosome 1 will not be the same as novel transcript number 1 from chromosome 2. This is perhaps an obvious point and there would be easy ways to make your novel transcript IDs unique but I wanted to make sure to point it out nonetheless.
ah, I see yes I added a prefix for novel annotations when initializing the database.
Admittedly I am running this on a very large data set. All in the merge BAM contains something like 180M primary alignments. The log output seems to be stopped at this point and I cannot see any addition to created temp files in talon_tmp, nor any writing to the TALON.db file in over 24 hr. Is there some time out that has occurred?
Running on a node that has 1TB of memory and seems to be fine. I checked the lines from QC log file and it seems it still hasn't gone through all alignments from the merged BAM file.