[DM-17872] Fix column names

rubin-dp0 / tutorial-notebooks

Tutorial Jupyter Notebooks for Data Preview 0, created and maintained by the Rubin Observatory Community Science Team.

Apache License 2.0

34 stars 17 forks source link

[DM-17872] Fix column names #174

Closed cbanek closed 8 months ago

cbanek commented 8 months ago

The returned column names for functions and other things needed a number to get to be unique over the whole query. For example, if you are doing two max()'s you need to be able to tell them apart.

review-notebook-app[bot] commented 8 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cbanek commented 8 months ago

Sorry, the logic in the TAP service which CADC took wanted each column to either be specifically named as a column, an AS, or have a number if it's a function. This also handles the math so if the column is like 1+column will be returned as col7.

The TAP service is doing what it is supposed to be doing. The notebook needs the fix. This fix is correct. We will not be changing the CADC TAP service code as it is very hard in the parser to determine what other columns are present.

This is how it is supposed to be and will be in the future.

gpdf commented 8 months ago

This seems inconsistent with the actual service behavior. Performing the query

SELECT min(ssObjectId), max(ssObjectId) FROM dp03_catalogs_10yr.MPCORB

yields the following result:

<VOTABLE xmlns="http://www.ivoa.net/xml/VOTable/v1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.4">
  <RESOURCE type="results">
    <INFO name="QUERY_STATUS" value="OK" />
    <INFO name="QUERY_TIMESTAMP" value="2024-02-23T21:32:26.096" />
    <INFO name="QUERY" value="SELECT min(ssObjectId), max(ssObjectId) FROM dp03_catalogs_10yr.MPCORB" />
    <TABLE>
      <FIELD name="min" datatype="long" />
      <FIELD name="max" datatype="long" />
      <DATA>
        <TABLEDATA>
          <TR>
            <TD>-9223370383071521539</TD>
            <TD>9223370875126069107</TD>
          </TR>
        </TABLEDATA>
      </DATA>
    </TABLE>
    <INFO name="placeholder" value="ignore" />
  </RESOURCE>
</VOTABLE>

The names returned are min and max with no numeric suffixes.

cbanek commented 8 months ago

data.lsst.cloud has not been updated, try data-int

MelissaGraham commented 8 months ago

data.lsst.cloud has not been updated, try data-int

I didn't get to trying this today but it inspired a few questions:

What is the timescale for deploying changes at data.lsst.cloud? We shouldn't merge this to main (or release to prod) until that day.
By chance did you check all the tutorial notebooks for use of MIN() and MAX()? Or maybe, is DP03_01 the only one failing CI at data-int and so we can be sure it's the only NB in need of this update?

gpdf commented 8 months ago

try data-int

Sorry, of course. I do see the "min1" behavior there.

This behavior is also not on CADC ops (e.g., on https://ws.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/argus/ ). I gather this is from the PR you and Pat were working on, so we do expect it to appear there as well?

I see that @pdowler is on board with the consequences: https://github.com/opencadc/tap/pull/116#pullrequestreview-1382338416

I don't object to the change; it's just as Melissa said - we just have to document it clearly and change everything that needs changing in a timely way.

gpdf commented 8 months ago

2. By chance did you check all the tutorial notebooks for use of MIN() and MAX()? Or maybe, is DP03_01 the only one failing CI at data-int and so we can be sure it's the only NB in need of this update?

I do have a lot of trust in the CI, but I did also look for "max" everywhere on the repo, and it does look like this is the only use of it in ADQL. There are some numpy usages on the Python side, of course.

MelissaGraham commented 8 months ago

I did test it at data-int this morning and now I see how the column names created from using MIN() and MAX() statements in an ADQL query are min1 and max2. The cell does not fail at data-int.

Thanks Gregory for doing a search on the repo for other ADQL uses of MIN/MAX. I agree that CI would have caught any other fails caused by the change so this branch is probably good to go.

I approved the changes but we should not merge yet, not until this is deployed at data.lsst.cloud.

cbanek commented 8 months ago

Everything has been ready to roll out since last Friday. Let's please merge this ASAP, and then pull this to the prod branch, and then I can deploy to data.lsst.cloud within seconds. Do you want me to handle this merge or are you in charge of that one?