seanharr11 / etlalchemy

Extract, Transform, Load: Any SQL Database in 4 lines of Code.
MIT License
556 stars 83 forks source link

Segmentation Fault #3

Closed EthanBannister closed 8 years ago

EthanBannister commented 8 years ago

I was able to use ETLAlchemy on Friday with no issues (today is Tuesday). It did everything that I hoped it would - and it was fast!

However, I tried to run the script again and I am now getting a segmentation fault. Nothing within the script/code has changed.

I made sure that all the dependencies were up to date and that all my system packages were also up to date. I even updated pip, not that it matters or anything.

When I use gdb to try and debug, the last library that gets called appears to be libgss.so.3:

`Sending source '<etlalchemy.ETLAlchemySource.ETLAlchemySource instance at 0x7ffff4187ab8>' to destination 'mysql://user:password@localhost/Temp_For_ETL'

Program received signal SIGSEGV, Segmentation fault. 0x00007ffff1868df5 in ?? () from /usr/lib/libgss.so.3`

I essentially have the exact code that was used in the examples that were given. For my source, I am pulling a table from a MSSQL DB and then migrating it into a MYSQL DB. I am using Microsoft's MSSQL ODBC driver (version 13 for Debian/Ubuntu), which appears to be working with no issues otherwise. GDB also indicates that all is well until the code hits the migrate function, where it then give me the segmentation fault.

Any ideas as to what I could be missing?

seanharr11 commented 8 years ago

Hmm...I've never seen a segfault when running the tool (or in any Python package really...), but it's probably a memory leak. It's pretty hard to cause a segfault in Python with garbage collection, so it must be something in outside the scope of etlalchemy, however I'd never rule it out. Can you post the full stack trace?

Sounds like a compatibility issue between your driver and one of the databases. It looks like libgss is a security library, and is probably used by the Microsoft ODBC driver to pass credentials along to SQL Server.

Personally, when dealing with SQL Server I use pyodbc and the FreeTDS driver, and have had success with several etlalchemy migrations, in various environments.

All this said, I have had significant issues with high I/O migrations FROM SQL Server instances located in AWS and Microsoft Azure. Out of curiousity, where is the SQL Server instance?

EthanBannister commented 8 years ago

strace.txt attached

I also have FreeTDS setup and it was working as far as I could tell, but I thought that I would give MS's version a go.

I will try things out using FreeTDS and let you know how things turn out shortly. strace.txt

EthanBannister commented 8 years ago

The MSSQL instance is external. I am running this script on an Ubuntu box (older 14.04).

It looks like the MSSQL driver may be the culprit, so I just need to do some more keyboard grinding I suppose. It was able to fetch all the rows, but migrating it to the new DB still appears to be an issue:

Dropping database 'Temp_For_ETL'
Creating database 'Temp_For_ETL'
Sending source '<etlalchemy.ETLAlchemySource.ETLAlchemySource instance at 0x7fb70770f9e0>' to destination 'mysql://user:password@localhost/Temp_For_ETL'
/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mssql/base.py:1643: SAWarning: Unrecognized server version info '95.12.255'.   Version specific behaviors may not function properly.   If using ODBC with FreeTDS, ensure TDS_VERSION 7.0 through 7.3, not 4.2, is configured in the FreeTDS configuration.
  ".".join(str(x) for x in self.server_version_info))
ETLAlchemySource (INFO) -
        *************************
        *** Total Tables: 1 ***
        *************************

ETLAlchemySource (INFO) - Reading Table Schema 'pull_table'...
ETLAlchemySource (INFO) - Loaded indexes and FKs for table 'pull_table'
ETLAlchemySource (WARNING) - Table 'pull_table' does not exist in the dst database (we will create this later...)
ETLAlchemySource (INFO) - Building query to fetch all rows from pull_table
ETLAlchemySource (INFO) - Done. (164558 total rows)
ETLAlchemySource (INFO) - Loading all rows into memory...
ETLAlchemySource (INFO) - Fetched 10000 rows
ETLAlchemySource (INFO) - Fetched 20000 rows
ETLAlchemySource (INFO) - Fetched 30000 rows
ETLAlchemySource (INFO) - Fetched 40000 rows
ETLAlchemySource (INFO) - Fetched 50000 rows
ETLAlchemySource (INFO) - Fetched 60000 rows
ETLAlchemySource (INFO) - Fetched 70000 rows
ETLAlchemySource (INFO) - Fetched 80000 rows
ETLAlchemySource (INFO) - Fetched 90000 rows
ETLAlchemySource (INFO) - Fetched 100000 rows
ETLAlchemySource (INFO) - Fetched 110000 rows
ETLAlchemySource (INFO) - Fetched 120000 rows
ETLAlchemySource (INFO) - Fetched 130000 rows
ETLAlchemySource (INFO) - Fetched 140000 rows
ETLAlchemySource (INFO) - Fetched 150000 rows
ETLAlchemySource (INFO) - Fetched 160000 rows
Traceback (most recent call last):
  File "etl_alm_asset.py", line 9, in <module>
    mysql_db_target.migrate(migrate_fks=False, migrate_indexes=False, migrate_data=True, migrate_schema=True)
  File "/usr/local/lib/python2.7/dist-packages/etlalchemy/ETLAlchemyTarget.py", line 86, in migrate
    migrate_data=migrate_data)
  File "/usr/local/lib/python2.7/dist-packages/etlalchemy/ETLAlchemySource.py", line 995, in migrate
    raw_rows = [list(row) for row in rows]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 10: invalid start byte

I will pick at it a little more to work out the kinks and report back.

Thanks for the help.

EthanBannister commented 8 years ago

Yeah, it does not appear to be a driver issue now since I am pulling the rows in with no issues. I double checked my odbc.ini, freetds.conf, and odbcinst.ini files, and they appear to be working as they should. I was also able to connect to the DB using isql with no issues, indicating the DSN when I ran it.

I read in the base.py file that the error at the top stating "SAWarning: Unrecognized server version info '95.12.255'" is a false-negative and that I can ignore it. Which makes sense since I am noticeably pulling the rows as shown in the output.

I then tried to figure out what could be happening with that whole 'UnicodeDecodeError' at the end of the trace, since that is apparently what is holding me up here. I am not sure about this part though. I am not sure if the migration doesn't like the MS encoding that it pulled down, or if I have a broken python/linux package somewhere.

Any ideas?

seanharr11 commented 8 years ago

There are known issues surrounding unix+PyODBC, SQL Server and Unicode, and there are some workarounds as well. SQLAlchemy does a good job of briefly explaining this: http://docs.sqlalchemy.org/en/latest/dialects/mssql.html#unicode-binds

You can try adding the following parameters to your DB connection string, and see if it solves your problem:

mssql+pyodbc://user:password@server?encoding=windows-1255&convert_unicode=True mssql+pyodbc://user:password@server?supports_unicode_binds=True ^^ This doesn't work, just tried.

@EthanBannister Try opening up your odbc.ini file (probably found in /etc/odbc.ini and editing the DSN such that it has a CharSet setting of 'utf8'. So the file should look like this:

[my_dsn]
Description=MY DSN 
Driver=FreeTDS
Database=MyDatabase
Servername=mydatabase.aws.amazon.com
CharSet=utf8
seanharr11 commented 8 years ago

If that doesn't work, clone this repo locally, and change the following (line 870 in ETLAlchemySource.py) from

  try:
       self.engine = create_engine(self.database_url)
  except ImportError as e:
       raise DBApiNotFound(self.database_url)

...to...

  try:
       # Note the supports_unicode_binds=True flag
       self.engine = create_engine(self.database_url, supports_unicode_binds=True)
  except ImportError as e:
       raise DBApiNotFound(self.database_url)

Build the package locally with pip install . in the project root directory, and then re-run your script and see if this fixes anything.

I can add support to pass this argument (_supports_unicodebinds) to the ETLAlchemySource object if this fixes the problem.

I've been through hell and back with some of these weird encodings, and all these problems originate in Microsoft-land...hope this works. Let me know.

EthanBannister commented 8 years ago

I cloned it to my dist-packages directory (with a few adjustments to the directory structure of etlalchemy), and made the changes. Sadly, it yields the same results.

Thank you again for your help on this. If you still want me to try some things, please let me know.

I think I am going to try to set up a Docker container for the time-being to see if that helps. Perhaps a python virtualenv? I'll look more into it...

seanharr11 commented 8 years ago

Did you try changing the of CharSet setting in odbc.ini file in the above comment? I think I remember that helping me solve a very similar problem in the past...

EthanBannister commented 8 years ago

I did. I also tried doing things like placing a hyphen in the argument (utf8 vs utf-8). Still no luck.

seanharr11 commented 8 years ago

I am closing this issue, and opening a new issue here.

Hop on that thread with any additional comments, I have a few ideas...

EthanBannister commented 8 years ago

Got it! Everything is right with the world again!

I just had to make some adjustments to the freetds.conf, odbc.ini, and odbcinst.ini files. Here are the entry examples of those files:

freetds.conf:

[DSN]
   host = 192.168.2.103
   port = 1433
   tds version = 8.0

odbc.ini:

[DSN]
Driver = FreeTDS
Port = 1433
Database = MyDatabase
TDS_Version = 8.0
ServerName = DSN

odbcinst.ini:

[FreeTDS]
Description = FreeTDS Driver v0.91
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so

It also did not provide me with the "SAWarning: Unrecognized server version info '95.12.255'" message during the output either - which was nice to also see.

Also, I changed a few things for my own purposes, and would like to share them with you in hopes that you may be able to implement it in future releases.

The first thing that I changed was within ETLAlchemySource File at line 650 which determines what each field is enclosed by. I changed this from a single quote to a double since the data that was being loaded appeared to include double quotes that were included within the CSV that is generated.

From: "--fields-enclosed-by=\"'\" "

To: "--fields-optionally-enclosed-by=\'\"\' "

I also commented out line 736 which removes the csv file that is generated after all is said and done. This is to allow me to have something that I can use to import, just in case something happens again where I can not longer use the script.

os.remove(data_file_path)

Finally, where can I donate for this project? This script was a help for me and I wouldn't mind throwing a little cash your way for your efforts.

Thanks again for the help!