piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.17k stars 383 forks source link

Expanded schemes and filetype registrations #403

Open georgeflanagin opened 4 years ago

georgeflanagin commented 4 years ago

Problem description

University of Richmond is a Linux/Python/Oracle shop, with MicroStrategy as the data warehouse. Our staff have written an ETL system entirely in Python 3, and that makes us an obvious consumer of smart_open. The ETL system executes a few more than 1000 integrations per day, and is considered Tier 1 business software. My impressions of the smart_open package are uniformly positive and I plan to incorporate it into our ETL software.

Additional schemes

We would like to take on the task of expanding the existing RFC 3986 schemes to encompass the things we are using now in other, less regular forms:

Additional filetypes

We also deal with a number of other file formats that smart_open could handle by expanding the register_compressor() hook. We are interested in tackling msgpackand pgp/gpg first.

Diagnosing files of unknown type

As well as having a few file extensions of our own, we frequently receive files from our vendors that are of unknown provenance. Files are sometimes bzipped instead of gzipped, and are sometimes encrypted without being named .pgp or .gpg. We would like to add inspection for the well known file signatures of various file types.

Versions

Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-redhat-7.7-Maipo Python 3.6.9 (default, Sep 11 2019, 16:40:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] smart_open 1.9

Question

Does the smart_open community think the above sounds like a "good idea," and something of wide interest, or should we fork the code and continue along our own vector? Given the work above and the existing commitments, we are not likely to have a deliverable before 30 June 2020.

George Flanagin
Computer Scientist
Puryear Hall, 118 UR Drive
Office G28
University of Richmond, VA 23173
+1.804.287.6392 (+1.80I.BURN.EXAms)

Checklist

Before you create the issue, please make sure you have:

mpenkov commented 4 years ago

Hi George,

Thank you for your interest in the smart_open library. The below is my personal opinion. @piskvorky , please let us know what you think about this, too.

I think adding the additional schemes would be a great idea. BTW, AFAIK https reading works now, but writing doesn't.

I'm not so sure about the other functionality: additional file types and file type diagnosis. I think this sort of stuff belongs on a level above what smart_open does. For example, look at our recommended way of dealing with gzip files. I think you could do the same thing with the proposed file types, couldn't you?

Similarly, there are already libraries out there for diagnosing file types. For example, see python-magic. Integrating it with smart_open is trivial:

from smart_open import open
import python-magic

with open(..., 'rb') as fin:
    # recommend using at least the first 2048 bytes, as less can produce incorrect identification
    detected_type = magic.from_buffer(open(fin.read(2048))
georgeflanagin commented 4 years ago

I should have made it clear by an italic or something that with https, I meant to have emphasis on the write.

What you said sounds fine to me. When I get back to my office on Monday, I will suggest that we proceed as you described. We use the magic package currently; I see that one of the goals may be to have as few entanglements with other packages as possible/practical. My vision is warped by building products/systems instead of packages/libraries. They are different skills.

In re: the zip files (or gpg files in our case), I see that you are referencing the same package independence. Our crew probably wants to be able to open('abc.gpg', 'rb'), but we will wrap our code externally, and figure out a way to pass in the list of recipients for open('abc.gpg', 'wb')

I have to think a bit more about the schemes. In our ETL software, we exploit the ~/.ssh/config file's ability to define aliases for the host names, and the connection methods. That's at the application level, though. The University's policies prohibit use of passwords for credentials, and as a consequence we tend to see everything through the lens of key-based authentication.

mpenkov commented 4 years ago

Yes, that's exactly right. The tradeoffs when building a product and a library are different. That said, we can probably find significant common ground. For example, I'm not opposed to using ~/.ssh/config for the ssh submodule. The existing library we use for SSH (paramiko) probably has support for interpreting that config file already.

georgeflanagin commented 4 years ago

paramiko definitely has that capability, and we use it now.