Open patrick-schultz opened 1 year ago
data.tar.gz The files from the user.
This is fine: x.txt
In [11]: import hail as hl
...: gwas = hl.read_table("gwas_filtered.ht")
...: loci_to_gene = hl.import_table("x.txt",impute=True)
...: locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
...: loci_to_gene = loci_to_gene.annotate(locus=locus)
...: loci_to_gene = loci_to_gene.key_by("locus")
...: loci_to_gene = loci_to_gene.select("gene")
...: loci_to_gene = loci_to_gene.filter(loci_to_gene.locus.position == 51749536)
...: loci_to_gene = loci_to_gene.checkpoint('/tmp/foo.ht', overwrite=True)
...: print(gwas.collect())
...: print(loci_to_gene.collect())
...: gwas.annotate(gene=loci_to_gene[gwas.locus].gene).collect()
2023-08-01 10:54:33.166 Hail: INFO: Reading table to impute column types
2023-08-01 10:54:33.864 Hail: INFO: Finished type imputation
Loading field '' as type int32 (imputed)
Loading field 'chromosome' as type str (imputed)
Loading field 'locus' as type int32 (imputed)
Loading field 'gene' as type str (imputed)
2023-08-01 10:54:34.137 Hail: INFO: Coerced sorted dataset
2023-08-01 10:54:35.270 Hail: INFO: wrote table with 1 row in 1 partition to /tmp/foo.ht
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'])]
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), gene='PXDNL')]
Out[11]: [Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]
This gives an NA gene:
chromosome locus gene
0 chr1 6109351 RNF207
1 chr1 6110293 RNF207
2 chr1 6110639 RNF207
3 chr1 6110841 RNF207
119111 chr8 51749536 PXDNL
Any one of those chr1 rows seem to trigger the issue.
Removing them eliminates the issue.
Changing the chromosome to 2 3 or 4 preserves the error. Chromosome 5 eliminates the error.
fails:
chromosome locus gene
0 chr4 6109351 RNF207
119111 chr8 51749536 PXDNL
succeeds
chromosome locus gene
0 chr5 1 RNF207
119111 chr8 51749536 PXDNL
Test code:
import hail as hl
gwas = hl.read_table("gwas_filtered.ht")
loci_to_gene = hl.import_table("x4.tsv",impute=True)
locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
loci_to_gene = loci_to_gene.annotate(locus=locus)
loci_to_gene = loci_to_gene.key_by("locus")
print(gwas.annotate(gene=loci_to_gene[gwas.locus].gene).collect())
Gene is missing:
chromosome locus gene
0 chr4 99498697 RNF207
119111 chr8 51749536 PXDNL
Gene is not missing:
chromosome locus gene
0 chr4 99498698 RNF207
119111 chr8 51749536 PXDNL
In [211]: import hail as hl
...: y = hl.read_table("chr8_51749536.ht")
...: x = hl.read_table('ch5_and_ch8.ht')
...: x = x.key_by(locus=hl.locus(x.locus.contig, hl.if_else(x.locus.contig == "chr4", x.
...: locus.position + 1, x.locus.position), reference_genome='GRCh38'))
...: print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:44:31.086 Hail: INFO: Coerced sorted dataset
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]
In [212]: import hail as hl
...: y = hl.read_table("chr8_51749536.ht")
...: x = hl.read_table('ch5_and_ch8.ht')
...: x = x.key_by(locus=hl.locus(x.locus.contig, hl.if_else(x.locus.contig == "chr4", x.
...: locus.position, x.locus.position), reference_genome='GRCh38'))
...: print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:44:37.843 Hail: INFO: Coerced sorted dataset
2023-08-01 11:44:38.519 Hail: INFO: wrote table with 2 rows in 1 partition to /tmp/__iruid_427336-Fs8dy7BWi7NrZic7JIAPzg
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene=None)]
This does not replicate:
In [229]: import hail as hl
...: y = hl.Table.parallelize([
...: hl.Struct(locus=hl.Locus(contig='chr8', position=51749536, reference_genome='GRCh38'), alleles=['G', 'T'])
...: ], key='locus')
...: x = hl.Table.parallelize([
...: hl.Struct(**{'': 0, 'chromosome': 'chr4', 'locus': hl.Locus(contig='chr4', position=99498697, reference_genome='GRCh38'),
...: 'gene': 'RNF207'}),
...: hl.Struct(**{'': 119111, 'chromosome': 'chr8', 'locus': hl.Locus(contig='chr8', position=51749536, reference_genome='GRCh38'),
...: 'gene': 'PXDNL'})
...: ], key='locus')
...: print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:50:29.558 Hail: INFO: Coerced sorted dataset
2023-08-01 11:50:29.993 Hail: INFO: Coerced sorted dataset
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]
https://github.com/hail-is/hail/files/12231481/data2.tar.gz
Simplest repro I have. One row in chr8... two rows in ch5...
In [2]: import hail as hl
...: y = hl.read_table("chr8_51749536.ht")
...: x = hl.read_table('ch5_and_ch8.ht')
...: print(y.annotate(gene=x[y.locus].gene).collect())
TableIR:
(TableCollect
(TableOrderBy (Alocus Aalleles)
(TableMapRows
(TableLeftJoinRightDistinct __uid_3
(TableRead
Table{global:Struct{},key:[locus,alleles],row:Struct{locus:Locus(GRCh38),alleles:Array[String]}}
False (TableNativeReader chr8_51749536.ht ))
(TableRead
Table{global:Struct{},key:[locus],row:Struct{locus:Locus(GRCh38),gene:String}}
False (TableNativeReader ch5_and_ch8.ht )))
(InsertFields
(SelectFields (locus alleles) (Ref row))
None
(gene
(GetField gene (GetField __uid_3 (Ref row))))))))
This still replicates as of 0.2.124-3189854eb0a7.
What happened?
Bug reported in https://discuss.hail.is/t/table-index-returning-none/3507/4
Using files provided by user:
gwas
contains only one row.loci_to_gene = loci_to_gene.filter(loci_to_gene.locus == l)
constructsloci_to_gene
to contain only one row with the same locus. Yet joining on locus behaves as if the loci are not equal.Version
0.2.120
Relevant log output
No response