Open georgeflanagin opened 4 years ago
Hi George,
Thank you for your interest in the smart_open library. The below is my personal opinion. @piskvorky , please let us know what you think about this, too.
I think adding the additional schemes would be a great idea. BTW, AFAIK https reading works now, but writing doesn't.
I'm not so sure about the other functionality: additional file types and file type diagnosis. I think this sort of stuff belongs on a level above what smart_open
does. For example, look at our recommended way of dealing with gzip files. I think you could do the same thing with the proposed file types, couldn't you?
Similarly, there are already libraries out there for diagnosing file types. For example, see python-magic. Integrating it with smart_open
is trivial:
from smart_open import open
import python-magic
with open(..., 'rb') as fin:
# recommend using at least the first 2048 bytes, as less can produce incorrect identification
detected_type = magic.from_buffer(open(fin.read(2048))
I should have made it clear by an italic or something that with https, I meant to have emphasis on the write.
What you said sounds fine to me. When I get back to my office on Monday, I will suggest that we proceed as you described. We use the magic package currently; I see that one of the goals may be to have as few entanglements with other packages as possible/practical. My vision is warped by building products/systems instead of packages/libraries. They are different skills.
In re: the zip files (or gpg files in our case), I see that you are referencing the same package independence. Our crew probably wants to be able to open('abc.gpg', 'rb')
, but we will wrap our code externally, and figure out a way to pass in the list of recipients for open('abc.gpg', 'wb')
I have to think a bit more about the schemes. In our ETL software, we exploit the ~/.ssh/config
file's ability to define aliases for the host names, and the connection methods. That's at the application level, though. The University's policies prohibit use of passwords for credentials, and as a consequence we tend to see everything through the lens of key-based authentication.
Yes, that's exactly right. The tradeoffs when building a product and a library are different. That said, we can probably find significant common ground. For example, I'm not opposed to using ~/.ssh/config
for the ssh
submodule. The existing library we use for SSH (paramiko) probably has support for interpreting that config file already.
paramiko definitely has that capability, and we use it now.
Problem description
University of Richmond is a Linux/Python/Oracle shop, with MicroStrategy as the data warehouse. Our staff have written an ETL system entirely in Python 3, and that makes us an obvious consumer of
smart_open
. The ETL system executes a few more than 1000 integrations per day, and is considered Tier 1 business software. My impressions of thesmart_open
package are uniformly positive and I plan to incorporate it into our ETL software.Additional schemes
We would like to take on the task of expanding the existing RFC 3986 schemes to encompass the things we are using now in other, less regular forms:
Additional filetypes
We also deal with a number of other file formats that
smart_open
could handle by expanding theregister_compressor()
hook. We are interested in tacklingmsgpack
andpgp
/gpg
first.Diagnosing files of unknown type
As well as having a few file extensions of our own, we frequently receive files from our vendors that are of unknown provenance. Files are sometimes bzipped instead of gzipped, and are sometimes encrypted without being named
.pgp
or.gpg
. We would like to add inspection for the well known file signatures of various file types.Versions
Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-redhat-7.7-Maipo Python 3.6.9 (default, Sep 11 2019, 16:40:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] smart_open 1.9
Question
Does the
smart_open
community think the above sounds like a "good idea," and something of wide interest, or should we fork the code and continue along our own vector? Given the work above and the existing commitments, we are not likely to have a deliverable before 30 June 2020.Checklist
Before you create the issue, please make sure you have: