opencb / hpg-bigdata

This repository implements converters and tools for working with NGS data in HPC or Hadoop cluster
Apache License 2.0
17 stars 14 forks source link

REST SparkSQL query: returned variants are duplicated by the number of explodes #60

Closed dapregi closed 8 years ago

dapregi commented 8 years ago

The SparkSQL obtained from the REST service returns duplicated variants. This seems to happen due to the explodes in the SQL query.

In this case there should be just one returned variant with id 'rs587604674'.

jtarraga commented 8 years ago

Using LATERAL VIEW and explode functions (to query inside array/map structures) has two effects: 1) Additional columns are created in the dataset original (one column per view). E.g. (the column 'cons' in the previous query):

+-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+ | id|names|chromosome| start| end|reference|alternate|strand| sv|length|type| hgvs| studies| annotation| cons| +-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+

2) The output dataset can contain 'duplicated' rows as mentioned. The content of the 'original' columns is identical in the 'duplicated' rows, but they differ in the 'additional' columns, in our example the 'cons':

+-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+ | id|names|chromosome| start| end|reference|alternate|strand| sv|length|type| hgvs| studies| annotation| cons| +-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+ |rs587604674| []| 22|16064870|16064870| C| T| +|null| 1| SNP|Map()|[[hgva@hsapiens_g...|[22,16064870,C,T,...|[0.38699999451637...| |rs587604674| []| 22|16064870|16064870| C| T| +|null| 1| SNP|Map()|[[hgva@hsapiens_g...|[22,16064870,C,T,...|[0.10199999809265...| +-----------+-----+----------+--------+--------+---------+---------+------+----+------+----+-----+--------------------+--------------------+--------------------+

Content of the additional 'cons' column for each row:

[[0.3869999945163727,phastCons,]] [[0.10199999809265137,phylop,]]

It can be fixed by post-processing the output dataset: 1) Removing additional columns using the function drop: dataset.drop(column_name) 2) Removing 'duplicated' rows suing the function dropDuplicates by a column with unique values (in variant datasets, the column "id"): dataset.dropDuplicates("id")