internetarchive / dweb-archivecontroller

GNU Affero General Public License v3.0
7 stars 2 forks source link

Which files are downloadable could be improved #8

Open mitra42 opened 4 years ago

mitra42 commented 4 years ago

Context Some files aren't marked downloadable when could be.

Proposal and Constraints Change the "downloadable" field to mean false - no undefined - yes - use a default Download type by Upper Casing the format string "xxx" - replace format with "xxx" And for future use function - apply function to file to get its downloadable string

Success Metrics Not clear ... dont have a good example to test against

mitra42 commented 4 years ago

Key code is in Details.inc on petabox tree Function is info_right and there is a loop over the files with lots of heuristics that would probably be better in a table like _formatarr.

Key step to implement above would be code that decides if to "continue" which would exclude the file

        // allow user-uploaded DJVU files to show in [DOWNLOAD OPTIONS] area
        $keep = ($formatUC=='ARCHIVE BITTORRENT'  ||
                 ($formatUC=='DJVU'  &&  !($file['SOURCE']=='derivative')));
        if (!$keep  &&  (preg_match('/(TORRENT|M3U|DJVU|ESSENTIA|METADATA|COLUMBIA)/', $formatUC)  ||
                         in_array($formatUC, [
                           'METADATA','SCANDATA','FLAC FINGERPRINT','CHECKSUMS',
                           'THUMBNAIL','JPEG THUMB','BITMAP IMAGE','BZIP2','MARC',
                           'UNKNOWN','CONTENTS','MARC BINARY','MARC SOURCE','DUBLIN CORE',
                           'JSON','LOG','CUE SHEET','DERIVATION RULES','VIDEO INDEX',
                         ])  ||
                         ($file['SOURCE']=='derivative' &&
                         in_array($formatUC, ['ANIMATED GIF','JPEG','PNG','SPECTROGRAM'])))) {
          continue;
        }

and then

 if ($this->isUnrestricted()) {
          // Skip lendable formats, if any
          if (Util::ends_with($filename, '_encrypted.pdf')
            || Util::ends_with($filename, '_encrypted.epub')
            ) {
            continue;

Or in words with highest rule taking precedence - and note this covers the non-lendable non-print disabled case only.

mitra42 commented 4 years ago

There is also in that code some special case for the download format based on prefixes - see #7 especially for PDF which dont seem to intersect the rules for keep/exclude

mitra42 commented 4 years ago

See also https://github.com/internetarchive/dweb-archive/issues/119 re sorting