openSNP / snpr

The sources of the openSNP website
http://opensnp.org
MIT License
172 stars 46 forks source link

Tons of garbage on opensnp #559

Open chaplin89 opened 3 months ago

chaplin89 commented 3 months ago

Hey, not sure if you're aware but there's really a lot of garbage there, as OpenSNP is probably not checking what users are uploading.

Here's a normalized list of file types I've found in your db:

I was curious about the EXEs, at least they don't seem to contain virus. One of them are from a tool called "MyHeritage Family Builder Genealogy Software" and all the rest are called "23andme to FASTA". It shouldn't be too hard to clean it and to put some checks after people are uploading something. I did this analysis using the file linux utility, I think it could probably be done on the server side as well? Watch out for command injection in case. A neat improvement would be to have all the files in the same format.

I'm attaching a list of files with their format: file_type.csv

Also the phenotype section doesn't seem very well monitored as someone created a "naked body phenotype" to use it to share a naked picture of himself. Not sure about the scientific value of that lol

gedankenstuecke commented 3 months ago

Hey @chaplin89, thanks for getting in touch and that list!

In our pre-parsing of uploaded files we already try to unzip files and get rid of the "wrong" files (aka everything that doesn't look like it's a 'correct' genotyping file) (see here: https://github.com/openSNP/snpr/blob/0a1d2aa891de2c4e86de69053aa188f870c47767/app/workers/preparsing.rb#L114), but for various reasons that seems to not always work out!

I'll have a think of how we can keep a better eye on it!

chaplin89 commented 3 months ago

I'm wondering what happens in that readline() when the input is a binary file. Seems like you're catching the exceptions during the unzip, but what about what happens later?

In any case, the way I would do this in python is probably to launch a file and see what happen. I think there's also a python lib for this, not sure about ruby.

As a side note, I see there's a system in that file where you're grepping the input file looking for e-mails. Not sure what the filename is at that point, but just friendly reminding that if the uploader can control even a single part of that filename, they'll also be able to execute code on the server.

EDIT Side note 2: you're unzipping the file, but perhaps a better approach could be to unzip it and then re-zip with gzip? You'll save tons of space and bandwidth and on the other side you can read a gzip file without (fully) decompressing it first.