ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.34k stars 600 forks source link

feat: Support for Microsoft Fabric Warehouse #9494

Closed csubhodeep closed 3 months ago

csubhodeep commented 4 months ago

Is your feature request related to a problem?

NO.

What is the motivation behind your request?

xref #9454

Describe the solution you'd like

I would like to be able to interact with a typical Microsoft Fabric Warehouse, which has quite similar interfaces as any other Datawarehousing technologies that ibis already supports. It would be nice if we could continue using the ibis.mssql backend for it.

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB & SQLServer

Code of Conduct

cpcloud commented 4 months ago

@csubhodeep Have you tried using the Ibis mssql backend for this? Maybe it just worksβ„’?

cpcloud commented 4 months ago

@djouallah Is this something that might interest you?

csubhodeep commented 4 months ago

@cpcloud yes I have tried the mssql backend and saw the error messages that i have mentioned in the discussion. It "kind of works".. but requires a bit of hacking.

csubhodeep commented 4 months ago

JFYI.. I think for most of the methods like list_tables, tables, list_databases, list_schemas, create_table could be supported quickly because, i think, these are just suffering from a mere case-sensitivity issue, from the error message that I had shared in the thread we get the following

ProgrammingError: ('42S02', "[42S02] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Invalid object name '<NAME OF WAREHOUSE>.information_schema.schemata'. (208) (SQLExecDirectW)")

In our data warehouse if I just execute the following

select *
from INFORMATION_SCHEMA.COLUMNS;

I get a nice table without any error. However, if I do the following (as in the error message)

select *
from information_schema.columns;

I get the exact same error on Fabric image

csubhodeep commented 4 months ago

@cpcloud @gforsyth Do we know by when could we solve this one?

Also none of the hack works if I try to create a table from an in-memory table made using using either ibis.memtable or pandas.DataFrame constructors.

It is now blocking us, hence, it would be nice to know if anyone could suggest any workarounds till this is officially supported.

csubhodeep commented 4 months ago

UPDATE

As per the their official docs

Currently, Latin1_General_100_BIN2_UTF8 is the default and only supported collation for both tables and metadata.

That means Fabric is currently case-sensitive presumably till the end of this quarter (if their development goes according to their plan). As per their roadmap they would support more collations later.

Now, from the perspective of ibis, at least from the error messages, if I understood them correctly, accessing the information_schema is crucial for some of the methods to work correctly (such as create_table, table, list_tables etc.)

Therefore, to circumvent the above, I wrote a small Python snippet that creates - a) a new schema with the name that ibis expects i.e. lower-cased information_schema, and b) views that the original schema i.e. (upper cased) INFORMATION_SCHEMA has but all of their names and the column names in them in lower case.

query_to_get_all_org_views_and_columns = """
SELECT 
    o.name AS view_name, 
    c.name AS column_name
FROM sys.all_columns c
INNER JOIN (
    SELECT *
    FROM sys.all_objects o
    WHERE schema_id = 3   -- schema_id = 3 is the schema_id for the information_schema
    ) o 
    ON c.object_id = o.object_id
"""

df_is = pd.read_sql_query(
    query_to_get_all_org_views_and_columns, 
    connection  # this is the same `pyodbc` connection created previously using custom authentication
)

cursor = connection.cursor()

for view_name, gdf in df_is.groupby("view_name"):
    select_columns = ", ".join([f"{col} AS {col.lower()}" for col in gdf["column_name"]])

    create_view_query = f"CREATE VIEW information_schema.{view_name.lower()} AS SELECT {select_columns} FROM INFORMATION_SCHEMA.{view_name};"
    cursor.execute(create_view_query)
    cursor.commit()

With the above I achieved the following image

And with that all API methods work normally. πŸ˜ƒ

Ofcourse, this is not at all a solution and rather a hack but I just wanted to share this in case someone else is looking for a temporary workaround.

deepyaman commented 4 months ago

Hi @csubhodeep! Thanks for digging into this further, and glad you were able to find a workaround. We just discussed this amongst the team just now, and we honestly don't have a much better solution at this time that wouldn't be equally hacky or have potential unintended side effects. Not sure if the support for case-insensitive collation will help here once released, as it sounds like it would affect EVERYTHING in the database (which could easily break existing logic).

At this point, the workaround might be the best path forward. Would also love to see if other people come across this issue and/or find your workaround helpful.

MarkPryceMaherMSFT commented 4 months ago

Full disclosure I work in the Fabric DW team at Microsoft, @csubhodeep is correct, this is due to the collation. You would see the same error if you tried this on a SQL Server in the same collation (Latin1_General_100_BIN2_UTF8). [TLRDR: the system tables/views are case sensitive too]

The good news, we are adding support for a case insensitive collation in Fabric DW, in Q3 this year. link

gforsyth commented 4 months ago

Hey all -- I'm going to take a look at doing this.

We built this backend against mcr.microsoft.com/mssql/server:2022-latest and it defaults (or all MSSQL servers default?) to SQL_Latin1_General_CP1_CI_AS.

Given that current users aren't having issues with that collation, if we switch our testing collation to be case-sensitive, it should (?) make things work with Fabric and other case-sensitive collations without breaking anyone using a case-insensitive collation.

Famous last words? I'll report back.

csubhodeep commented 4 months ago

@gforsyth with my limited understanding of DB internals (and Fabric), I would like to ask a stupid question, does that mean ibis.mssql backend would do multiple rounds of running all (DDL) queries one with different cases (like, for instance, if upper case fails try with lower case)?

gforsyth commented 4 months ago

@csubhodeep -- no, that would be an unacceptable trade-off.

What I mean is that if we update the mssql backend to test against a case-sensitive collation (and if it works...), then we'll correctly refer to, say, INFORMATION_SCHEMA.SCHEMATA and that will work on Fabric.

For users who were using a case-insensitive collation, it won't matter because SCHEMATA == schemata

csubhodeep commented 4 months ago

Ah ok I see, thanks for the explanations πŸ˜ƒ

gforsyth commented 3 months ago

@csubhodeep -- if you install from main, Ibis should work on Fabric now, since we are now testing against the same collation (Latin1_General_100_BIN2_UTF8)

csubhodeep commented 3 months ago

@csubhodeep -- if you install from main, Ibis should work on Fabric now, since we are now testing against the same collation (Latin1_General_100_BIN2_UTF8)

ok thanks a lot I would try it out at the earliest.

csubhodeep commented 3 months ago

Currently I am facing some issues while logging into our Fabric workspace using my corporate account. I would try and give a feedback as soon as it is resolved. Meanwhile, @MarkPryceMaherMSFT it would be really nice if you (or someone from your team) could test this out. Thanks ! πŸ˜ƒ

csubhodeep commented 3 months ago

Hello @gforsyth , I have tested this and most of the API methods listed in this page works.

Although kind of unrelated to this thread, I would suggest, for the read_* methods that have not been implemented yet, you could either mark them (with colors or symbols) or simply remove them till they are actually implemented.

Thanks a lot again for pushing this feature through. πŸ˜ƒ