S3 getting errors again.

Connor9220 commented 4 years ago

Started getting strange errors on S3.

Oct 17 15:57:42 ip-172-16-0-10 CRITICAL asm_ dbfs.S3Storage.get An HTTP Client raised and unhandled exception: name must be a byte string

I've researched this, and have no idea what this about.

bobintetley commented 4 years ago

It looks like a bad interaction between boto3 and whatever it is using to handle HTTP requests (probably urllib3 or the requests module).

Make sure you install boto3 and its dependencies from pip3 instead of from your distribution's package manager.

Connor9220 commented 4 years ago

That's strange, because, it was working perfectly fine a while ago..

bobintetley commented 4 years ago

Have you done an apt-get upgrade at some point maybe?

We're using Python 3 on a few production servers and not seeing this error.

If you run pip3 list you can get the versions of installed packages. These are the values from one of our Python3 production servers:

# pip3 list
Package          Version      
---------------- -------------
asn1crypto       0.24.0       
awscli           1.18.12      
boto3            1.12.12      
botocore         1.15.12      
certifi          2018.8.24    
chardet          3.0.4        
cheroot          6.5.4        
colorama         0.4.3        
cryptography     2.6.1        
docutils         0.15.2       
entrypoints      0.3          
html5lib         1.0.1        
httplib2         0.11.3       
idna             2.6          
iotop            0.6          
jmespath         0.9.5        
keyring          17.1.1       
keyrings.alt     3.1.1        
olefile          0.46         
Pillow           5.4.1        
pip              18.1         
psycopg2         2.7.7        
pyasn1           0.4.8        
pycrypto         2.6.1        
pycurl           7.43.0.2     
PyGObject        3.30.4       
PyPDF2           1.26.0       
PySimpleSOAP     1.16.2       
python-apt       1.8.4.1      
python-dateutil  2.8.1        
python-debian    0.1.35       
python-debianbts 2.8.2        
python-memcached 1.58         
pyxdg            0.25         
PyYAML           5.2          
reportbug        7.5.3-deb10u1
reportlab        3.5.13       
requests         2.21.0       
rsa              3.4.2        
s3transfer       0.3.3        
SecretStorage    2.3.1        
setuptools       40.8.0       
six              1.12.0       
stripe           2.46.0       
urllib3          1.24.1       
web.py           0.40.dev1    
webencodings     0.5.1        
wheel            0.32.3       
xhtml2pdf        0.2.2

Connor9220 commented 4 years ago

It's very possible the server has been updated. But, to my knowledge, it should still be running everything under python2.7

Package Version

backports.functools-lru-cache 1.6.1 boto3 1.2.2 botocore 1.16.19 chardet 2.3.0 Cheetah 2.4.4 cheroot 8.3.0 cryptography 1.2.3 docutils 0.12 enum34 1.1.2 flup 1.0.2 futures 2.2.0 idna 2.0 ipaddress 1.0.16 jaraco.functools 2.0 jmespath 0.9.0 Markdown 3.1.1 mod-python 3.3.1 more-itertools 5.0.0 mysqlclient 1.3.7 ndg-httpsclient 0.4.0 Pillow 3.1.2 pip 20.2.3 psycopg2 2.6.1 pyasn1 0.1.9 pychecker 0.8.19 Pygments 2.1 pyinotify 0.9.6 pyOpenSSL 0.15.1 python-apt 1.1.0b1+ubuntu0.16.4.9 python-dateutil 2.4.2 python-memcached 1.53 reportlab 3.3.0 requests 2.9.1 roman 2.0.0 setuptools 44.1.0 six 1.14.0 urllib3 1.13.1 web.py 0.37 wheel 0.29.0

Connor9220 commented 4 years ago

I had to downgrade botocore.. Odd.. So, apparently, they're images in the cache file, that's not been pushed up to S3. Is their a way to force those to get pushed back up? This has been broken for a while now.. I don't know why S3 wouldn't have thrown the error on the upload..

bobintetley commented 4 years ago

Sorry, there isn't. There's no cache index (would be slow and heavily contended) either or way of figuring out what a cache file contains as the filenames are simple hashes of the keys used to access them (typically ID from the dbfs table and database name).

The cache is only really there as a very temporary read/write through mechanism to save bandwidth and S3 costs.

I think it will be difficult to rename the files to their dbfs ID and extension and then manually upload them to S3, but that's the only real graceful way to fix it.

You'd need to write a script that read the URL column from the dbfs table (or extracts them to a CSV file or hard coded list in a script) and then read that list. Your script should then generate an md5 hash of DATABASE:URL (where DATABASE is your database name from sitedefs). An example value for hashing might look like asm:s3:102.jpg which gives a value of e99c0ccfb65cfb2885355cfe4956b3be

Once you've generated that hash, you can test whether it's present in /tmp/asm_disk_cache/database - if it is, make a copy of that file and rename it to the URL without the s3: prefix - 102.jpg in our example.

After processing you'll now have a bunch of files that are named DBFSID.extension - you can manually upload these to your S3 bucket to effectively put them back.

They will expire and be removed from your cache on a rolling week period, so you need to move quick - at least take a back up of /tmp/asm_disk_cache/database now

bobintetley commented 4 years ago

If you're wondering why you never saw an error, it's because while the file is put in the cache for immediate usage, a new thread is spawned to upload to S3 in the background. It makes the system a lot more responsive when adding media (it could take 3-4 seconds per upload in some regions), but it means upload is asynchronous with no way of feeding back to the user if an error occurred.

The botocore error sounds like polyglot code that was designed to run on Python 3 running on Python 2. ASM fully supports and recommends Python 3 and we've been using it in production for over a year now.

Connor9220 commented 4 years ago

So, I ran this query on MySQL.

select concat('mv ', md5(concat('database:',URL)), ' ',trim(LEADING 's3:' from URL)) from dbfs where URL is not NULL

Which produced a nice little file I can run inside the cache directory that renames the files. I then used awscli to pull a complete list of all objects, and removed any of them in the cache. (As to not upload dupes) and then upload the remaining files.

bobintetley commented 4 years ago

Well done, that is a very efficient way of doing it!

On Sun, 18 Oct 2020, 20:24 Connor9220, notifications@github.com wrote:

So, I ran this query on MySQL.

select concat('mv ', md5(concat('database:',URL)), ' ',trim(LEADING 's3:' from URL)) from dbfs where URL is not NULL

Which produced a nice little file I can run inside the cache directory that renames the files. I then used awscli to pull a complete list of all objects, and removed any of them in the cache. (As to not upload dupes) and then upload the remaining files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bobintetley/asm3/issues/927#issuecomment-711401820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEN3X7AGJFNFNW3Z6BJ3DMLSLM6H3ANCNFSM4SUOILQQ .

Connor9220 commented 4 years ago

Yea, but, that just gets me the file names.. Now I have to convert the cached files to binaries for images and html files.. The html files are simply, just strip everything but line 7, convert the \n to true newlines, and remove the leading post characters. the images are a bit more of a issue.. I need to convert those to binary.

bobintetley commented 4 years ago

The file format is Python pickle. You should be able to have a Python script open the file and call pickle.load() on the file handle. This will give you a dictionary, the "value" key contains the binary data for the object. You can then save that to a new file (or overwrite the file you just read if you're feeling brave/confident and have made a backup!)

bobintetley commented 4 years ago

There's a function called _lrunpickle(fname) in cachedisk.py that loads the data from a pickled file

Connor9220 commented 4 years ago

I figured out the Python pickle.. Here is a simple program I wrote that converts it..

import sys
import pickle

file = open(sys.argv[1],"rb")
data = pickle.load(file)
file.close()
file = open(sys.argv[1],"wb")
file.write(data["value"])
file.close()

sheltermanager / asm3

S3 getting errors again. #927