hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
982 stars 246 forks source link

[query] error when annotating a matrix table with a table (joins) #13339

Open patrick-schultz opened 1 year ago

patrick-schultz commented 1 year ago

What happened?

Bug reported in https://discuss.hail.is/t/table-index-returning-none/3507/4

Using files provided by user:

> gwas = hl.read_table("gwas_filtered.ht")
> loci_to_gene = hl.import_table("loci_to_gene.tsv",impute=True)

> gwas.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
> gwas.show()
+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh38> | array<str> |
+---------------+------------+
| chr8:51749536 | ["G","T"]  |
+---------------+------------+

> locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
> loci_to_gene = loci_to_gene.annotate(locus=locus)
> loci_to_gene = loci_to_gene.key_by("locus")
> loci_to_gene = loci_to_gene.select("gene")

> l = gwas.locus.collect()[0]
> loci_to_gene = loci_to_gene.filter(loci_to_gene.locus == l)
> loci_to_gene.show()
+---------------+---------+
| locus         | gene    |
+---------------+---------+
| locus<GRCh38> | str     |
+---------------+---------+
| chr8:51749536 | "PXDNL" |
+---------------+---------+

> gwas.annotate(gene=loci_to_gene[gwas.locus].gene).show()
+---------------+------------+------+
| locus         | alleles    | gene |
+---------------+------------+------+
| locus<GRCh38> | array<str> | str  |
+---------------+------------+------+
| chr8:51749536 | ["G","T"]  | NA   |
+---------------+------------+------+

gwas contains only one row. loci_to_gene = loci_to_gene.filter(loci_to_gene.locus == l) constructs loci_to_gene to contain only one row with the same locus. Yet joining on locus behaves as if the loci are not equal.

Version

0.2.120

Relevant log output

No response

danking commented 1 year ago

data.tar.gz The files from the user.

danking commented 1 year ago

This is fine: x.txt

In [11]: import hail as hl
    ...: gwas = hl.read_table("gwas_filtered.ht")
    ...: loci_to_gene = hl.import_table("x.txt",impute=True)
    ...: locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
    ...: loci_to_gene = loci_to_gene.annotate(locus=locus)
    ...: loci_to_gene = loci_to_gene.key_by("locus")
    ...: loci_to_gene = loci_to_gene.select("gene")
    ...: loci_to_gene = loci_to_gene.filter(loci_to_gene.locus.position == 51749536)
    ...: loci_to_gene = loci_to_gene.checkpoint('/tmp/foo.ht', overwrite=True)
    ...: print(gwas.collect())
    ...: print(loci_to_gene.collect())
    ...: gwas.annotate(gene=loci_to_gene[gwas.locus].gene).collect()
2023-08-01 10:54:33.166 Hail: INFO: Reading table to impute column types
2023-08-01 10:54:33.864 Hail: INFO: Finished type imputation
  Loading field '' as type int32 (imputed)
  Loading field 'chromosome' as type str (imputed)
  Loading field 'locus' as type int32 (imputed)
  Loading field 'gene' as type str (imputed)
2023-08-01 10:54:34.137 Hail: INFO: Coerced sorted dataset
2023-08-01 10:54:35.270 Hail: INFO: wrote table with 1 row in 1 partition to /tmp/foo.ht
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'])]
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), gene='PXDNL')]
Out[11]: [Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]
danking commented 1 year ago

This gives an NA gene:

    chromosome  locus   gene
0   chr1    6109351 RNF207
1   chr1    6110293 RNF207
2   chr1    6110639 RNF207
3   chr1    6110841 RNF207
119111  chr8    51749536    PXDNL
danking commented 1 year ago

Any one of those chr1 rows seem to trigger the issue.

danking commented 1 year ago

Removing them eliminates the issue.

danking commented 1 year ago

Changing the chromosome to 2 3 or 4 preserves the error. Chromosome 5 eliminates the error.

fails:

    chromosome  locus   gene
0   chr4    6109351 RNF207
119111  chr8    51749536    PXDNL

succeeds

    chromosome  locus   gene
0   chr5    1   RNF207
119111  chr8    51749536    PXDNL
danking commented 1 year ago

Test code:

import hail as hl
gwas = hl.read_table("gwas_filtered.ht")
loci_to_gene = hl.import_table("x4.tsv",impute=True)
locus = hl.locus(loci_to_gene.chromosome, loci_to_gene.locus, "GRCh38")
loci_to_gene = loci_to_gene.annotate(locus=locus)
loci_to_gene = loci_to_gene.key_by("locus")
print(gwas.annotate(gene=loci_to_gene[gwas.locus].gene).collect())

Gene is missing:

    chromosome  locus   gene
0   chr4    99498697    RNF207
119111  chr8    51749536    PXDNL

Gene is not missing:

    chromosome  locus   gene
0   chr4    99498698    RNF207
119111  chr8    51749536    PXDNL
danking commented 1 year ago

data2.tar.gz

In [211]:     import hail as hl
     ...:     y = hl.read_table("chr8_51749536.ht")
     ...:     x = hl.read_table('ch5_and_ch8.ht')
     ...:     x = x.key_by(locus=hl.locus(x.locus.contig, hl.if_else(x.locus.contig == "chr4", x.
     ...: locus.position + 1, x.locus.position), reference_genome='GRCh38'))
     ...:     print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:44:31.086 Hail: INFO: Coerced sorted dataset
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]

In [212]:     import hail as hl
     ...:     y = hl.read_table("chr8_51749536.ht")
     ...:     x = hl.read_table('ch5_and_ch8.ht')
     ...:     x = x.key_by(locus=hl.locus(x.locus.contig, hl.if_else(x.locus.contig == "chr4", x.
     ...: locus.position, x.locus.position), reference_genome='GRCh38'))
     ...:     print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:44:37.843 Hail: INFO: Coerced sorted dataset
2023-08-01 11:44:38.519 Hail: INFO: wrote table with 2 rows in 1 partition to /tmp/__iruid_427336-Fs8dy7BWi7NrZic7JIAPzg
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene=None)]
danking commented 1 year ago

This does not replicate:

In [229]: import hail as hl
     ...: y = hl.Table.parallelize([
     ...:     hl.Struct(locus=hl.Locus(contig='chr8', position=51749536, reference_genome='GRCh38'), alleles=['G', 'T'])
     ...: ], key='locus')
     ...: x = hl.Table.parallelize([
     ...:     hl.Struct(**{'': 0, 'chromosome': 'chr4', 'locus': hl.Locus(contig='chr4', position=99498697, reference_genome='GRCh38'),
     ...:                  'gene': 'RNF207'}),
     ...:     hl.Struct(**{'': 119111, 'chromosome': 'chr8', 'locus': hl.Locus(contig='chr8', position=51749536, reference_genome='GRCh38'),
     ...:                  'gene': 'PXDNL'})
     ...: ], key='locus')
     ...: print(y.annotate(gene=x[y.locus].gene).collect())
2023-08-01 11:50:29.558 Hail: INFO: Coerced sorted dataset
2023-08-01 11:50:29.993 Hail: INFO: Coerced sorted dataset
[Struct(locus=Locus(contig=chr8, position=51749536, reference_genome=GRCh38), alleles=['G', 'T'], gene='PXDNL')]
danking commented 1 year ago

https://github.com/hail-is/hail/files/12231481/data2.tar.gz

Simplest repro I have. One row in chr8... two rows in ch5...

In [2]:     import hail as hl
   ...:     y = hl.read_table("chr8_51749536.ht")
   ...:     x = hl.read_table('ch5_and_ch8.ht')
   ...:     print(y.annotate(gene=x[y.locus].gene).collect())

TableIR:

(TableCollect
  (TableOrderBy (Alocus Aalleles)
    (TableMapRows
      (TableLeftJoinRightDistinct __uid_3
        (TableRead
          Table{global:Struct{},key:[locus,alleles],row:Struct{locus:Locus(GRCh38),alleles:Array[String]}}
          False (TableNativeReader chr8_51749536.ht ))
        (TableRead
          Table{global:Struct{},key:[locus],row:Struct{locus:Locus(GRCh38),gene:String}}
          False (TableNativeReader ch5_and_ch8.ht )))
      (InsertFields
        (SelectFields (locus alleles) (Ref row))
        None
        (gene
          (GetField gene (GetField __uid_3 (Ref row))))))))

hail-20230801-1154-0.2.120-be655bbda3cb.log

danking commented 1 year ago

This still replicates as of 0.2.124-3189854eb0a7.