Open Nusob888 opened 3 years ago
Could you please show me the full command you used? Thanks.
Hi,
sure thing, this was for chemistry v3, but I have tried it for V1 and V2, same thing happens. Each time I adjusted the ranges to fit the chemistries
` run-trust4 -t 24 -f ~/hg38_bcrtcr.fa --ref ~/human_IMGT+C.fa -u ~/XXXXX_2 --barcode ~/XXXXX_1, --barcodeRange 0 15 + --barcodeWhitelist ~/3M-february-2018.txt --UMI ~/XXXXX_1 --umiRange 16 27 + --od ~/outdir/
`
Could you please show me the full on-screen output from TRUST4? I need to make sure that the crash happened in the fastq-extractor program. Thanks.
Sorry, our cluster locks out copy and pasting the output and I have to keep the study IDs of the samples confidential. But yes it occurs just as the fastq-extractor starts. It then prints out "system" followed by the input commands as above. Then followed by failed: 11 at run-trust4 line 48.
Is that enough? if not I can email you a screen capture
Just want to make sure, the 3M-february-2018.txt file is compressed in the cellranger package, so have you decompressed it before running TRUST4?
Did you see the line " Start to extract candidate reads from read files." output on the screen?
Hi, yes I took it from a local installation of cellranger and uploaded it on our cluster within a custom directory that I then supply the absolute path to.
The reason was because our cell ranger is a module installation that cannot be edited by end-users.
But the file is the decompressed txt.
What I see is:
[Sat Aug 28 19:42:51 2021] Start to extract candidate reads from read files. system ~/TRUST4/fastq-extractor -t 24 -f ~/hg38_bcrtcr.fa -o ~/out_dir//TRUST_XXXXX_2_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/3M-february-2018.txt --umiStart 16 --umiEnd 27 -u ~/XXXX_2.fastq.gz --barcode ~/XXXX_1.fastq.gz --UMI ~/XXXX_1.fastq.gz failed: 11 at ~/TRUST/run-trust4 line 48
This is very strange. Could you please show me the first few lines for XXX_1.fastq.gz, XXX_2.fastq.gz and 3M-february-2018.txt files?
it is just the absolute paths to the locations:
for our system its
/groupdir/users/username/projects/projectx/gex/XXX_1.fastq.gz /groupdir/users/username/projects/projectx/gex/XXX_2.fastq.gz /groupdir/users/username/reference/cellranger_whitelists/v3/3M-february-2018.txt
I would be surprised if that is an issue as trust4 runs to completion when I do not include the whitelist. It will extract the barcodes and umis and reads fine and generate the outputs.
It seems to be something to do with the whitelist matching and correction. The only thing might be if you call directly on any shared libraries or /bin located files, we do not have access to those. Or if under the hood you are calling on cell ranger? but I didn't see that in the script.
Sorry, I mean the first few lines of the content in those files, so I can check whether the lengths, especially in the whitelist file, match the files on our server. Thanks.
Oh right. sorry.
R1
@XXXXX.1 1 length=28 CNAAGCGAGCTTACGACGCCTTTCATGC +XXXXX.1 1 length=28 F#F,FF,FFFFFFF:,FFFFFFFFFFFF @XXXXX.2 2 length=28 AATGCCAAGAACGCGTAGACATCTAGCT +XXXXX.2 2 length=28 FFFFFFFFFFFFFFFFFFFFFFFFFFFF @XXXXX.3 3 length=28 CCTCACACACAACCGCTTCTGGTGCAGA
R2
@XXXXX.1 1 length=151 GAAGAATGGTACAAATCCAAGTTTGCTGACCTCTCTGAGGCTGCCAACCGGAACAATGACGCCCTGCGCCAGGCAAAGCAGGAGTCCACTGAGTACCGGAGACAGGTGCAGTACCTCACCTGTGAAGTGGATGCCCTTAAAGGAACCAAAA +XXXXX.1 1 length=151 FFFFFFF:FFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFF:F:FFFF:FFFFFFFFF::FFFFFFFFFFF,FFFFFFFFFFFFFFFFFF,FFFFF:FFFFFFFFFFF,FFFFFF:FFFFFFFFFF,FFF:FFFFFFFF:,,FF:F: @XXXXX.2 2 length=151 AGCCACTGCGCCCAGCCGGTCTTACCCACTTTTATATATATTTGAAACACAGTGGAGTAGATACCTAATAAATATTTGTTAGATTTATCTGAAGAAACTTAGATTAAATTCACTCGGAAGGTGCCTTATGTCAGGGATACTAAAAAATTAT +XXXXX.2 2 length=151 FFFFFFFFF,FFFFFFF,FFFFF:FFFFF::FFFFFFF:F:FFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FF,FF,FFFFFFF:FFFFFFFFFFF,FFFFFFF:FF,FF,,FFFFF,F,:,F::,F:FFFF @XXXXX.3 3 length=151 GCCTGGTGGCTCACACCTGTAATCCTAGCACTTTAGGAGGCCAAGGGGGGTGGATCACGAGGTCAGGGGTTCGAGACCAGCCTGACCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGCGTGGTGGCAGGCGCCT
whitelist:
AAACCCAAGAAACACT AAACCCAAGAAACCAT AAACCCAAGAAACCCA AAACCCAAGAAACCCG AAACCCAAGAAACCTG AAACCCAAGAAACGAA AAACCCAAGAAACGTC AAACCCAAGAAACTAC AAACCCAAGAAACTCA AAACCCAAGAAACTGC AAACCCAAGAAACTGT AAACCCAAGAAAGAAC AAACCCAAGAAAGACA AAACCCAAGAAAGCCT AAACCCAAGAAAGCGA AAACCCAAGAAAGGAT AAACCCAAGAAAGGTA AAACCCAAGAAAGTCT AAACCCAAGAAATAGG AAACCCAAGAAATCCA AAACCCAAGAAATGAG
Sorry, can I just add a request onto the same issue chain.
Is it possible to index the annot.fa with the report.tsv or cdr3.out?
or Add read counts into the annot.fa names?
I am trying to re-annotate the annot.fa assembly contigs with Igblast but there is no way to index back to the read counts. Since annot.fa is a consensus contig, the read fragment count is lost.
This would be really helpful as a feature since I predominantly use Immcantation as my analysis workflow to get mutation counts etc. But in its current state, I cannot do it in the same way as I can with MIXCR. In MIXCR indexing is possible as they provide the "targetsequence" (consensus assembly) in their report file.
Thanks very much for the consideration.
The first column in the cdr3.out file is the contig id in the annot.fa file, which can be used for index. Or do you mean other way of indexing? The report.tsv file is the summary of cdr3.out file, and the second last column is the representative's contig id in the annot.fa file.
For the whitelist issue, it finished without any error on our data even when testing with different settings. I may need some more time to test. In our previous experiment, the barcode correction marginally improved the results.
Thanks for the update on the whitelist issue and ongoing troubleshooting.
For the indexing, what I have noticed, is that there are multiple duplicated assemble ids. e.g. assemble0 occurs up to 20 times for me.
When I look closer, this is the case in the report also, and it seems this is down to the fact that trust4 allows for SMH, and therefore multiple cids can be assigned to different sequence contigs.
The cdr3. out file nicely contains a consensus_id index, but this is not present in the annot.fa and therefore we cannot index the exact sequence in the annot.fa file to the report or cdr3.out file. Also the cdr3.out file oddly has read counts that look like averages? 76.08 for example, whereas the report has round numericals
The sequence contig in the annot.fa file is the consensus and encodes/compresses highly similar sequences, such as from SHMs as you mentioned. Therefore, in the _cdr3.out file, TRUST4 also outputs those minor CDR3s encoded in the consensus. In other words, the CDR3s are all from the sequence in annot.fa, and they are listed by the second column "index_within_consensus" in cdr3.out. I think for the CDR3s from the same contig/consensus, you can pick the one with highest abundance as representative.
As for the decimal numbers of the abundance, it is because that when TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could support more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus.
Hope these explanations help.
Thanks, that makes sense.
However, I remain a little confused how trust4 works in this case:
Lastly, I would prefer to index all annot.fa contigs with those in the cdr3.out or report.tsv files. The trouble with picking a highly abundant one is that the nucleotide sequences are not identical, therefore for diversity analysis, you are losing information if you just pick by consensus or abundance for use in other tools.
This seems like it could be easily resolved by adding a index_within_consensus ID into the annot.fa file?
Yes, you can regard the consensus assembly being the most abundant clonotype. The receptor sequence is much longer than a read, for example, it is difficult to decide whether two variations in V and J genes respectively are from the same receptor sequence. So TRUST4 only tries to recover the encoded sequence in CDR3, which in many cases can be fully covered by one read.
I think I can write a simple script to break up consensuses in annot.fa file with information from _cdr3.out file, so they have a one-to-one mapping.
That would be great. Thank you.
Alternatively, if the fully assembled contig could be added to the report.tsv and cdr3.out file, that could be good as a user can then generate a fasta file directly from the contig.
re: how trust4 recovers sequences: Does this mean that trust4 does not actually assemble a full contig per B cell receptor variation?
In a B cell receptor I would consider any variations of V, junction, J to be potentially relevant and warrants assembly into a full contig so long as the read fragments were of good quality and the overlaps with high confidence.
This is how Bracer and MIXCR handle their contigs.
I understand the logic from the methods section of trust4, that from a computational efficiency perspective the consensus system is faster. However, from an analysis side, grouping to a single assembly loses a lot of information and granularity on how diverse an individual clonal family is. Many BCR analysis specifically look at the phylogeny of individual clones of interest to understand clonal selection/affinity maturation.
Is there a possibility to generate variations of a consensus sequence by knitting reads of variance into their own individual contigs. e.g. if a read has 2 mutations in CDR1 region, but otherwise aligns exactly to the consensus, then annot.fa will add an additional contig that is the consensus but deviates in those 2 nucleotides in CDR1?
If there are enough variations, TRUST4 will break them up into different contigs. The full contigs are for the highly expressed receptors. I think Bracer is for single-cell assemblies. In the single-cell setting, TRUST4 will not mix the variations from different cells into the same consensus. As for the last part, if a read has 2 mutations in CDR1, but there could be mutations in CDR2 or CDR3 from other reads, and it is very difficult to knit them together in bulk RNA-seq setting.
Thanks for explaining, that makes sense now.
In that case the only desired addition, if possible, would be to break up annot.fa to match cdr3.out and report.tsv or to have consensus full length sequences as an additional column in the report.tsv. That would help a lot with indexing back to the read counts for all within consensus ids
I just add a perl script AddSequenceToCDR3File.pl to the scripts folder in the github repo. You can run it as: perl AddSequenceToCDR3File trust_cdr3.out trust_annot.fa > combined_cdr3.out It will add the assembled contig sequence to each row in the cdr3 file and also change the CDR3 sequence in the contig accordingly.
Please let me know whether it works.
Hello, Sorry I left a comment here and hope didn't interrupt the discussion. I met the same error with whitelist. From my questions before I knew I need barcode whitelist correction. Now when I try to process it, I met same issue. Thus, I left a comment here. Next is my code and output message in terminal:
[000003 ~]$ ~/software/TRUST4/run-trust4 -t 8 -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz --barcodeRange 0 15 + --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -f ~/software/TRUST4/hg38_bcrtcr.fa --ref ~/software/TRUST4/human_IMGT+C.fa -o ~/project/sc/output/trust4/gex_list/ [Tue Aug 31 10:36:49 2021] TRUST4 begins. [Tue Aug 31 10:36:49 2021] SYSTEM CALL: ~/software/TRUST4/fastq-extractor -t 8 -f ~/TRUST4/hg38_bcrtcr.fa -o ~/project/sc/output/trust4/gex_list/_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz [Tue Aug 31 10:36:50 2021] Start to extract candidate reads from read files. system ~/software/TRUST4/fastq-extractor -t 8 -f ~/software/TRUST4/hg38_bcrtcr.fa -o ~/project/sc/output/trust4/gex_list/_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz failed: 139 at ~/software/TRUST4/run-trust4 line 48.
Here I changed all the home directory to \'~' to delete my names in it. Also sorry for the long folder names of data.
I noticed one thing that in README the example code used --barcodeWhiteList and the parameter is --barcodeWhitelist with lower case in L of List. Will this make any differences?
Please tell me more about these. Thank you.
That is a typo in README. It should be --barcodeWhitelist. I will fix that right away. Thank you!
Hi @mourisl , this works a treat! thank you so much for that!
I will keep the thread open for @cwxiix and the whitelist issue, but this solves the indexing
I tried several different data sets on our server and could not reproduce this whitelist error. What system were you using? Was it MacOS?
For me, I am using a Xshell terminal to connect to a server and the system is CentOS Linux release 7.5.1804 (Core) . Is this information helpful?
Our cluster structure is a little complex, but each node runs a version of linux. Just to sanity check, are there any hardcoded parts in the script that require the whitelist to be in a cellranger library? because I stored the whitelists in a project specific directory. I had a look and it didn't appear to be the case, but thought I would check.
Hi,
I have a same problem with 'Nusob888'. The running code is here.
run-trust4 -f /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/hg38_bcrtcr.fa \
--ref /mnt/S3/data2/workbench/Users/ktkim/ref/tenX/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
-u /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R2_*.fastq.gz \
--barcode /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R1_*.fastq.gz \
--barcodeRange 0 15 \
--barcodeWhitelist /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/translation/3M-february-2018.txt
and error is like this.
failed: 11 at /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/run-trust4 line 48.
Those fasta files came from 10X, and some infomation is here.
R1 :
@A00930:41:HWGVCDSXX:3:1101:5104:1000 1:N:0:AAGCAGTC
NCAATCTTCACGAAGGCCCAGCCGCA
+
#FFFFFFFFFFFFFFFFFFFFFFFFF
R2 :
@A00930:41:HWGVCDSXX:3:1101:5104:1000 2:N:0:AAGCAGTC
GACATCCGAGGATGGATTGAAGGTAGGATAGGGGCTCACCGCTGATCCGGGACCACCTTTGGATGACTTCACAGTTTGAACATATTCCTGCTCTTCAT
+
FFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFF
whitelist came from cellranger :
AAACCCAAGAAACACT AAACCCATCAAACACT
AAACCCAAGAAACCAT AAACCCATCAAACCAT
AAACCCAAGAAACCCA AAACCCATCAAACCCA
AAACCCAAGAAACCCG AAACCCATCAAACCCG
AAACCCAAGAAACCTG AAACCCATCAAACCTG
AAACCCAAGAAACGAA AAACCCATCAAACGAA
AAACCCAAGAAACGTC AAACCCATCAAACGTC
AAACCCAAGAAACTAC AAACCCATCAAACTAC
AAACCCAAGAAACTCA AAACCCATCAAACTCA
AAACCCAAGAAACTGC AAACCCATCAAACTGC
Thanks to read.
@ttab963 I think the reason for your case might be different. There are two issues with your command:
Thanks for your reply. I changed the code, but same error occurs.
nohup run-trust4 -f /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/hg38_bcrtcr.fa \
--ref /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/human_IMGT+C.fa \
-u /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R2_*.fastq.gz \
--barcode /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R1_*.fastq.gz \
--barcodeRange 0 15 + \
--barcodeWhitelist /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/translation/3M-february-2018.txt &
@ttab963 Your whitelist is from the cellranger translation folder. It contains two columns maybe for the translation of multiome data. TRUST4 assumes each row contains one barcode, so you should use the barcode in the path /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/ instead of that in the translation folder. (You still need to unzip the whitelist there).
Sorry for bothering you.
I realize my data is actually build with 10x 5' library, so I change whitelist to ~/cellranger-4.0.0/lib/python/cellranger/barcodes/737K-august-2016.txt
.
so I fix again and check that whitelist file has just one column.
But same error again.
Cellranger process with those data has any problems so data is not corrupted. And plus, I also run well with example file or without whitelist option.
Thanks again, @mourisl.
Hello, I just wanted to mention that I am getting this same error when running TRUST4 with the whitelist (failed: 139 at ./TRUST4/run-trust4 line 48.
). It only happens when I run on a remote server running Ubuntu 20.04 -- I was able to run the same analysis with the whitelist successfully on MacOS. Also, when I remove the -barcodeWhitelist
option when running on the server, it runs successfully.
Thanks!
I have the exact same error at the fastq-extractor when running with my own whitelist, (when removed it works fine):
failed: 11 at /home/cthouv/miniconda3/envs/ngs_bcr/bin/run-trust4 line 48.
Here's my input:
run-trust4 -f /home/cthouv/references/trust4/bcrtcr.fa --ref /home/cthouv/references/trust4/IMGT+C.fa -1 /home/cthouv/bcrseq_t1/seqs/frag2min_S1_L001_R1_001.fastq -2 /home/cthouv/bcrseq_t1/seqs/frag2min_S1_L001_R2_001.fastq --read1Range 20 -1 --barcode /home/cthouv/bcrseq_t1/seqs/frag2min_S1_L001_R1_001.fastq --barcodeRange 0 7 + --barcodeWhitelist /home/cthouv/references/trust4/barcode_whitelist_24.txt --repseq
my barcode_whitelist_24.txt:
GAAGTGCT
CCTCGTTA
CGAACAAC
TCGTCTGA
ATCTGACC
TTGGACTG
CGCCTTAT
GACTACGA
TCGAACCT
TGAGCTGT
CCTAGAGA
GAGTAGAG
ATGAGTGC
ATGTGGAC
AACGTAGC
GGATTCAC
TCAATCCG
ATTCCGCT
CCTACCTA
CGTCTTCA
GTACACCT
TGTCAGTG
AACGCCTT
GACTTGTG
I'm using it on a remote server running CentOS Linux, version 7. Thanks!
@rawlings-lab Which TRUST4 version are you using? This issue should be fixed with a recent update.
My fault on that. I installed through conda and it looks like it installed 1.05 and I hadn't noticed. Got it installed through github and the error is gone. Much appreciated for the quick response.
Hi @Nusob888 @mourisl @jmostrom013 @rawlings-lab , I also run into the same error yesterday:
[...] Start to extract candidate reads from read files.
system /domus/h1/paucza/.conda/envs/stag/bin/fastq-extractor -t 6 -f ../data/TRUST4/PRJNA549239/SRR9310044/TRUST_SRR9310044_3_toassemble --barcodeStart 0 --barcodeEnd 15 -u ../data/FASTQ/PRJNA549239/SRR9310044/SRR9310044_3.fastq.gz --barcode ../data/FASTQ/PRJNA549239/SRR9310044/SRR9310044_2.fastq.gz failed: 139 at /home/paucza/.conda/envs/stag/bin/run-trust4 line 48.
It seems that the issue has nothing to do with the white list, as I was getting the error even on bulk-RNAseq datasets. The issue was that TRUST4 was not creating the output directory specified in --od
(as normal for other tools). For me, creating the folder beforehand solved it:
mkdir -p $OUT/TRUST4/$i/$j
# FOR EACH SINGLE_CELL DATASET
for i in $DATASETS_SC; do echo $i
RUNS=$(basename -a $OUT/SRA/$i/*)
for j in $RUNS; do echo $j
mkdir -p $OUT/TRUST4/$i/$j # <--------------------------- ADDED THIS LINE
run-trust4 \
-f TRUST4/mouse/GRCm38_bcrtcr.fa \
--ref TRUST4/mouse/mouse_IMGT+C.fa\
-u $OUT/FASTQ/$i/$j/${j}_3.fastq.gz\
--barcode $OUT/FASTQ/$i/$j/${j}_2.fastq.gz\
--od $OUT/TRUST4/$i/$j \
--barcodeRange 0 15 + \
--barcodeWhitelist TRUST4/737K-august-2016.txt\
-t 1 &
done
done
Hi,
Sorry for spamming the issues. I am having an error come up whenever I try to add a whitelist.
essentially the run report gives me no additional explanation other than: failed: 11 at run-trust4 line 48
I have tried to look over the script but can't really identify why it should fail there.
If I remove the barcodeWhitelist parameter, it carries on as normal. I have also triple checked the whitelists and can match them to the dataset fine.
Is it possible to check if this is a reproducible error? or something specific to my run?
It occurs on all datasets I have tried it on.
Thanks