mobidic / knotAnnotSV

A simple script to create a customizable html file from an AnnotSV output.
GNU General Public License v3.0
18 stars 6 forks source link

HOW TO SHOW the color gradient for the 'gene_name' column ? and how to open the large html in browser? #21

Closed lpsyy closed 5 months ago

lpsyy commented 5 months ago

I would like to express my gratitude to the author for developing this annotation tool for SV (Structural Variant) files in Annotsv format. It has made it convenient for me to filter and analyze. Recently, I started using the software with family WGS (Whole Genome Sequencing) data to identify SVs. The annotated.tsv file ended up being 265 MB large. After running Knotannotsv, the generated HTML file reached 700 MB, which is too large to open in a browser. Is this a normal size? When I tried to visualize the data using a table-based tool, the color gradient for the 'gene_name' column did not appear. How can I resolve this issue? I look forward to your response!

SCR-20240602-taah
lpsyy commented 5 months ago

and my knotannotsv version is 1.1

lpsyy commented 5 months ago

I just update the newest version,the output file still no full / spilt button(1/2),and the gene_name still no color gradient ......,and i test the exaple which is successful, could you give me some hints how to fix it? this is my code: perl ./knotAnnotSV2XL.pl --configFile ./config_AnnotSV.yaml --annotSVfile /Volumes/lps_SSD/knotannotsv/SV_ANANLYSIS/F15.SV-CNV.annotated.tsv --outDir ./example/ --genomeBuild hg38 and when the program running, some info showed just like this"0_1_0_0.0000_0__1_17868619_17868638_TRA_1 Ignoring URL 'http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:17868619-17868638&hgt.out1=submit&highlight=hg38.chr1:17868619-17868638' since it exceeds Excel's limit of 65,530 URLs per worksheet. See LIMITATIONS section of the Excel::Writer::XLSX documentation. at ./knotAnnotSV2XL.pl line 1286." please help me! THX!

lgmgeo commented 5 months ago

Hi @lpsyy,

I'll let Thomas help you regarding the color gradient.

My advice for your family WGS analysis is to reduce the number of lines in the AnnotSV output with the following options:

lpsyy commented 5 months ago

thank you for your reply, i tried to copy some lines in the tsv file, and used the new tsv to run the program, but the output xlsm still couldn't show the color gradient, and it's weird the ACMG_CLASS all showed NA, actually the ACMG_CLASS is present, could you give me some ideas about the problems? Snipaste_2024-06-03_11-49-49

lpsyy commented 5 months ago

the picture above showed the 1/2 but all info disappear...... I used the online app convert the xlsx(i used excel open the annotated.tsv and extract some lines generate test xlsx) to tsv, then i run the program.

lgmgeo commented 5 months ago

I think you did something wrong in your copy. Windows and Unix have different end-of-line characters. You need to keep a Unix format. Perhaps this could explain the problem.

lpsyy commented 5 months ago

THX! i'll use linux to split the large tsv then try to run the program. but the color gradient and no 1/2 button in the output file still need your help, thanks for your reply timely!!

lgmgeo commented 5 months ago

@thomasguignard (the knot developper) can help you better for this.

lpsyy commented 5 months ago

could you help me contact with@thomasguignard ? thank you very much!

thomasguignard commented 5 months ago

Hello @Ipsyy,

I can suggest some ways to troubleshoot.

Concerning gene color, it is based on LOEUF_bin annotation, are you sure these values are present in your tsv? Do you use website or local installation to run AnnotSV? Do you get gene color with smaller input , such as example file?

Concerning Excel's limit of 65,530 URLs per worksheet, I haven't found a workaround so far. It may impact genecolor as well. It seems you have several SV in your input. How many SV do you have in input? If you have more than 65530 SV, you must filter out some SV.

If some of your SV are HUGE, like an inversion of whole chromosome 1, you will bring all genes of chromosome 1 in your input. So first, maybe try to filter out

When you first run annotSV, did you try to use -rankFiltering (e.g.: "3,4,5" or "3-5"), default = "1-5,NA" as @lgmgeo suggested? You will get a cleaner output.

In knotannotSV command line, you can use this parameter --geneCountThreshold +minimal of Nb of top gene to reduce number of line.

Do you absolutely need "split" lines for gene content? you can keep only full line with annotSV command line -annotationMode full.

lpsyy commented 5 months ago

hi thomasguignard thank you for your reply! I just split the annotated.tsv to 1500 lines per tsv, then the xlsx results still didn't show the color gradient,and no 1/2 button at left top. but it's ok by running knotAnnotSV.pl. and the annotated.tsv results was annotated by local AnnotSV. about "Excel's limit of 65,530 URLs per worksheet", when i split the primary annotated.tsv, the error didn't show anymore. I'll look the tsv results to filter the large inversion or deletion just like your advise. and i'll run the -rankFiltering (e.g.: "3,4,5" or "3-5") to reduce the output info the --geneCountThreshold parameter could be used like this"--geneCountThreshold 40"(i notice the readme doc, which said the limit is 40)? all my reults are full pattern, so this is reason left top don't have 1/2 button at the output xlsx? Looking forward to your response!

lpsyy commented 5 months ago

these are the pictures about my split tsv running results Snipaste_2024-06-03_16-47-21 Snipaste_2024-06-03_16-47-37

thomasguignard commented 5 months ago

Yes, in "full" only mode, there is no 1/2 button. The --geneCountThreshold parameter is only applied to "both" mode. This is designed to limit the number of split lines.

lpsyy commented 5 months ago

so if my data just have full pattern, the --geneCountThreshold parameter have no function? by the way, i also got into trouble about "ACMG_CLASS NA",when i used excel open the primary annotated.tsv, i extracted the acmg_class 4-5 grade, and generated the new file, then i used the notepad in windows converted the new file to tsv format, and ran the knotannotsv, the acmg_class in output file is NA,did you ever have the problem? how should i avoid the peoblem?THX!

lpsyy commented 5 months ago

this is the partial results Snipaste_2024-06-03_18-41-20

lpsyy commented 5 months ago

Is the NA that appears in acmg_class related to the information hinted at in the following diagram? Snipaste_2024-06-03_19-31-34

thomasguignard commented 5 months ago

In a linux shell terminal, please try this command on your annotSV tsv file (assuming that ACMG class column is the last one): awk 'NR==1 || $NF==5 || $NF==4{print}' annotated.tsv > annotated_4-5.tsv

Then run knotAnnotSV on annotated_4-5.tsv and see if NA are still there.

I confirm in full pattern, the --geneCountThreshold parameter have no function.

lpsyy commented 5 months ago

Excellent!the problem have been solved! could you tell me some reasons you think the problems i had got in? THX!

thomasguignard commented 5 months ago

Not 100% sure, but as a way of life, I avoid handling files back and forth between Windows and Linux as much as possible. I hope this could help.

lpsyy commented 5 months ago

I understand! Finally, I would like to express my sincere gratitude for your diligent and patient explanation, which has effectively resolved my issue. Many thanks! I wish you all the best!

lpsyy commented 5 months ago

hi @thomasguignard i just have another question, and i notice the example you supplied have two sample in annotated.tsv, but when ran the program, the sample info were gone, and when i ran my data(4 peoples in the family) had the same situation, how could i know which the SV come from?

lgmgeo commented 5 months ago

The SV comes from samples reported in the Samples_ID column: image Moreover, if present in the VCF, the GT is reported: image

cf the AnnotSV README: image

lpsyy commented 5 months ago

hi @lgmgeo THX! but i just saw the examples the knotannotsv provided, the SAMPLE _ID info did not exist in the xlsx and html files, and i also check the config file, there was no SAMPLE_ID's info, and the usage instruction also no this set-up step, could you tell me more about this? Snipaste_2024-06-04_15-00-38 Snipaste_2024-06-04_15-02-15 these two pictures show the header of xlsx of example files "AnnotSV_3.4.tsv"

lgmgeo commented 5 months ago

I think you just need to add "Samples_ID" in the config_AnnotSV.yaml

image

lpsyy commented 5 months ago

thank you so much! the sample_id info came out 👍