tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

classify.pl error #35

Closed MichelMoser closed 4 years ago

MichelMoser commented 4 years ago

Hi,

I try to create allele.ctg.table files but run into trouble running the classify step.

classify.pl returns following errors:

Use of uninitialized value $tgene in substitution (s///) at /mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 54, <IN> line 14417.
Use of uninitialized value $tgene in hash element at mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 55, <IN> line 14417.
Use of uninitialized value $tgene in substitution (s///) at mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 54, <IN> line 14418.
Use of uninitialized value $tgene in hash element at /mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 55, <IN> line 14418.

I thought its formatting problem but cant spot anything obvious in my input files.

I formated cds files according to tutorials with identical name in gff: target gff3:

Flye15k_scaffold_584    Ssal_COMBO_polished     gene    640018  642089  .       -       .       ID=ENSSSAT00000004279_1_path1;Name=ENSSSAT00000004279.1

target cds

>ENSSSAT00000004279_1_path1

ref gff3

ssa29   ensembl gene    42450663        42460615        .       -       .       ID=gene_ENSSSAG00000067226;biotype=protein_coding;gene_id=ENSSSAG00000067226;logic_name=ensembl;version=1

ref cds

>gene_ENSSSAG00000067226

after blastn_parse.pl Eblast.out

gene_ENSSSAG00000004085 ENSSSAT00000008969_1_path1      97.113  381     11      0       1       381     1       381     0.0     643
gene_ENSSSAG00000004089 ENSSSAT00000075924_1_path1      99.383  324     2       0       1       324     1       324     4.68e-167       588

Is there a problem using underscores as gene names? thanks, Michel

tangerzhang commented 4 years ago

Hi Michel, Sorry for the late response. This is because the script reads gene names from Name=xxx rather than ID=xxx. I noticed that the gene names in target cds are exactly the same with gene names from ID=xxx. Therefore, you may need to modify the codes in Line 53 and 65 as below:

my $tgene    = $1 if(/ID=(\S+)/);
my $rgene = $1 if(/ID=(\S+)/);
MichelMoser commented 4 years ago

sorry, my bad. I assumed Name == ID, which is False of course =) I formatted the gff and changed "Name=" accordingly and it works now.

The output file Allele.ctg.table shows a lot of lines where the contig names are the same. Does that mean that ALLHIC will remove Hi-C contacts within contigs as well? I hope not! Or should i parse such lines out?

ssa28   36674475        Flye15k_contig_313      Flye15k_contig_313
sssa28   36674475        Flye15k_contig_313      Flye15k_contig_313
ssa28   36819276        Flye15k_contig_850      Ssal_flye30K_contig_01029
ssa28   36879032        Ssal_flye30K_contig_00195       Ssal_flye30K_contig_00195

Thanks, Michel

tangerzhang commented 4 years ago

ALLHiC will not remove Hi-C contacts within contigs. Therefore, no worry about the lines that share the same contigs.