nadegeguiglielmoni / GraphUnzip

Unzip assembly graphs with Hi-C data and/or long reads.
GNU General Public License v3.0
25 stars 1 forks source link

graphunzip.py linked reads prints errors #20

Closed ykkim0127 closed 1 month ago

ykkim0127 commented 2 years ago

Hi ! I got this errors while running graphunzip.py using linked reads. 

,ignoring the line, are you sure the BX:Z: tags are there ?

And then output is only filled with this lines.

<80>^D<95>¶^@^@^@^@^@^@^@<8c>^Gcopyreg<94><8c>^N_reconstructor<94><93><94><8c>^Qscipy.sparse._dok<94><8c>
dok_matrix<94><93><94><8c>^Hbuiltins<94><8c>^Ddict<94><93><94>}<94><87><94>R<94>}<94>(<8c>^F_shape<94>Mq        Mq      <86><94><8c>^Hmaxprint<94>K2<8c>^Edtype<94><8c>^Enumpy<94>h^P<93><94><8c>^Bf8<94><89><88><87><94>R<94>(K^C<8c>^A<<94>NNNJÿÿÿÿJÿÿÿÿK^@t<94>bub

And when I check barcode information from sam file (wich -C option) using grep, there is no BX:Z:. I used 10X Genomic Chromium fastq files which were downloaded from website. Is this error related to the missing barcode information ?

RolandFaure commented 2 years ago

Hi, Sorry for the late reply. If there are no BX:Z: tags in the sam file, GraphUnzip will not be able to retrieve barcode information and that is probably the reason the program fails. Now the question is: where is the barcode information in your sam file ? If it's tagged differently than BX:Z: you can tell me and I'll add the possibility for GraphUnzip to read another tag. If the barcodes are not in the SAM file, then you need to see how barcode information is contained in the original fastq file (e.g. were the barcodes detached from the reads ? Were they tagged as BX:Z: ?). A possible explanation for the problem would be that the barcodes are not yet detached from the reads and that you need to run Longranger basic to detach them.

ykkim0127 commented 2 years ago

Hi ! I sent you some files via email.

Yes. There is no barcode tags in the sam file and also in the fastq. There is only few information about library in the header line except N:0:. Please see the attached file(2_edit.fastq). Then is it mean barcode is already detached or still included in sequence but tag is missing?

And I found hic_interactionmatic.txt is also filled with those lines same as linkedreads_interactionmatric.txt(above). Please see the attached two files (hic/linkedsreads_interactionsmatrix.txt). I thought it would be wrong output but when I run graphunzip.py unzip with this two matric files, it ends properly with final assembly.gfa and assembly.fasta without any error message. Below is a command line I used, and then I copied the message when this command finished.

./graphunzip.py unzip -g assembly_graph.gfa -i hic_interactionmatrix.txt -k linkedreads_interactionmatric.txt -l m64062_m64032_3.gaf -o 220928.gfa -f 220928.fasta
WARNING:  221  contigs out of  2417  had no coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
Loading the Hi-C interaction matrix
Loading the linked-reads interaction matrix
================

Everything loaded, moving on to untangling the graph

================

*Untangling the graph using long reads*

Reading the gaf file...
Finished going through the gaf file.
Building consensus bridges from all the long reads
Done building consensus bridges                 
Now we will determine through an iterative process what contigs of the assembly are present only once in the final genome
Out of  1625  supposed single-copy contigs,  26  were not actually haploid. Recomputing until all the single-copy contigs are robust
Let's move on to actually untangling the graph
Now we correct the last quirks by looking a posteriori at the graph               
Merging contigs that can be merged...

*Done untangling the graph using long reads*

*Untangling the graph using Hi-C*

Normalizing the interaction matrix
Finished normalizing the interaction matrix
Determining the list of all knots of the graph that I will try to solve
Finished determining the list of knots, there are  196  of them. Now determining pairs of single-copy contigs that should be linked through other contigs.
Finished matching haploid contigs, now we'll move on to  determining the paths linking them
Finished determining the paths, now modifying the graph and duplicating necessary contigs
Finished round of untangling number  1 . Untangled  30261  contigs. Going on one supplementary round if  30261 > 0 and if  1 < 2
Determining the list of all knots of the graph that I will try to solve
Finished determining the list of knots, there are  77  of them. Now determining pairs of single-copy contigs that should be linked through other contigs.
Finished matching haploid contigs, now we'll move on to  determining the paths linking them
Finished determining the paths, now modifying the graph and duplicating necessary contigs
Finished round of untangling number  2 . Untangled  10834  contigs. Going on one supplementary round if  10834 > 0 and if  2 < 2
Merging contigs that can be merged...

*Done untangling the graph using Hi-C*

Now exporting the result

The problem is, final assembly.fasta have more contigs than draft assembly. Please see the attached two files (draft_report.txt, unzipped_report.txt). To explain, # of contigs is increased 1,102 -> 1,544 and # of contigs (>=50000bp) is decreased 294 ->108. However, N50 value is increased 32,645,358 -> 65,251,927. Would u please explain why the contigs are fragmented even after combined with 10X and Hi-C data ? And is it related to barcode information?

ykkim0127 commented 2 years ago

Dear RolandFaure,

Thanks for replying. I attached few files for additional questions. It would be much appreciated if you could examine those results. Thanks for your help.

Best regards, Yu-kyung Kim

2022년 9월 20일 (화) 오후 4:05, RolandFaure @.***>님이 작성:

Hi, Sorry for the late reply. If there are no BX:Z: tags in the sam file, GraphUnzip will not be able to retrieve barcode information and that is probably the reason the program fails. Now the question is: where is the barcode information in your sam file ? If it's tagged differently than BX:Z: you can tell me and I'll add the possibility for GraphUnzip to read another tag. If the barcodes are not in the SAM file, then you need to see how barcode information is contained in the original fastq file (e.g. were the barcodes detached from the reads ? Were they tagged as BX:Z: ?). A possible explanation for the problem would be that the barcodes are not yet detached from the reads and that you need to run Longranger basic https://support.10xgenomics.com/genome-exome/software/pipelines/latest/what-is-long-ranger to detach them.

— Reply to this email directly, view it on GitHub https://github.com/nadegeguiglielmoni/GraphUnzip/issues/20#issuecomment-1251929278, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2VNMV7RLDCU63N67I53TXTV7FO5JANCNFSM6AAAAAAQLNIMU4 . You are receiving this because you authored the thread.Message ID: @.***>

RolandFaure commented 2 years ago

Hi Yu-kyung, I'd love to take a look at your data. The email being sent via github, I could not see the attached files. Could you send them directly to roland.faure@irisa.fr ?

RolandFaure commented 2 years ago

For the weird lines, I realized with your message that this is normal behavior. GraphUnzip uses pickle.dump to dump the files, thus they are not directly readable. This should not concern you. For linked reads I cannot answer you until I saw the file. However, I do not think linked reads will really be useful if you already have long reads + Hi-C. Concerning the result of GraphUnzip, I will have a look at it. What I will do (and you can too) is to re-run GraphUnzip with options -r and --dont_merge and visualize the resulting GFA on Bandage. This will give you a more precise idea of what GraphUnzip did.

RolandFaure commented 2 years ago

Hi @ykkim0127, I've looked attentively at your results. GraphUnzip worked fine. Here is an explanation of your results:

In conclusion, the assembly you get as an output of GraphUnzip is an improved assembly with higher contiguity than the original assembly and no gaps. I hope I have been clear, don't hesitate to reply if there are still some points that remain unclear.

Roland