mrkkrp / zip

Efficient library for manipulating zip archives
Other
81 stars 27 forks source link

Blind Archive support. #20

Open ghost opened 8 years ago

ghost commented 8 years ago

I wanted to create a Handle independent of the zip module. I believe what I have is working currently. If you want I can create a pull request. If you want to see the code it is in my repo.

main = do
  let rubbishFileName = "rubbishfile"
  h <- openFile rubbishFileName ReadWriteMode
  removeFile rubbishFileName
  hSetBinaryMode h True          

  !leftovers <- createBlindArchive h $ do
    setArchiveComment "This archive is just a test"
    parseRelFile "./lmn/foo" >>= mkEntrySelector >>= addEntry Store "this is the file content"

  hSeek h AbsoluteSeek 0 

  arch <- openFile "archive.zip" ReadWriteMode
  hSetBinaryMode arch True          

  hGetContents h >>= hPutStr arch

  hClose arch
  print leftovers

I can safely write data to the archive w/o actually exposing it to the filesystem unless I want to. The hPutStr could just as well be to a socket or a conduit to an httpd service, etc.

ghost commented 8 years ago

For more possibilities for both handles from and to other processes that may have privileged access to archives that the current process lacks see: http://blog.varunajayasiri.com/passing-file-descriptors-between-processes-using-sendmsg-and-recvmsg

ghost commented 8 years ago

Todo: add blindCopyEntry so an open Handle to another archive can be solicited for information.

mrkkrp commented 8 years ago

And where are the archive contents till you write them into a file? In memory?


I'm not sure what is going on in your example, but the approach seems hackish.

ghost commented 8 years ago

No hack at all. The contents are in the filesystem. As long as the handle is maintained by at least one thread the data remains in the filesystem. No memory is involved. It is no different than any other file opened anywhere else with the exception that due to the unlink (remove) there exists no directory reference to the file.

As soon as the Handle is closed, or the thread/process exits, the file contents are freed by the filesystem. No cleanup necessary.

This leaves one free to create an archive on the fly in a blind/anonymous file. The file can be read or written to by any process/thread that has access to the Handle, which includes passing the Handle to other processes on the OS via sockets.

There is nothing new or 'hackish' about this idiom. It has been around for decades.

There are other applications for Handle passing via OS sockets that need not include unlinking the file from the directory structure. A server that can pass restricted archives to an unprivileged process by making the Handle available via OS socket, no copy of data required.

ghost commented 8 years ago

http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/

mrkkrp commented 8 years ago

Thank you, I'll look into that.

ghost commented 8 years ago

This is a related and useful technique: http://stackoverflow.com/questions/14514997/reopen-a-file-descriptor-with-another-access/14515466#14515466

ghost commented 8 years ago

Haskell has had support for Handle/fd passing via sockets for many years.

https://hackage.haskell.org/package/network-2.6.3.1/docs/Network-Socket.html#g:10

ghost commented 8 years ago

What follows is a working piece of code that uses createBlindArchive to create an archive from database documents, and then uploads the archive via Yesod. Once the hClose runs, the archive file vanishes from the filesystem. Had exceptions prevented hClose from being reached, as soon as the thread died the archive file and any contents would vanish.

data Document = Document { documentName :: FilePath
                         , cronos :: UTCTime
                         }

download :: FilePath -> [(Document, ByteString)] -> Handler TypedContent
download archivePath documents = do
  h <- liftIO $ do
    h <- openFile archivePath ReadWriteMode
    removeFile archivePath
    hSetBinaryMode h True          

    createBlindArchive h $ do
      setArchiveComment "This archive was created by Me!"
      forM_ documents
              (\(doc, payload) -> do
                 es <- mkEntrySelector =<< parseRelFile (documentName doc)
                 setModTime (cronos doc) es
                 addEntry Store payload es
              )
    hSeek h AbsoluteSeek 0
    pure h

  respondSource "application/zip" $ handleToBuild h

handleToBuild :: Handle -> Source (HandlerT site IO) (Flush DBB.Builder)
handleToBuild h = sourceHandle h =$= lumps
  where
    lumps = maybeM (liftIO $ hClose h) (\b -> yield (Chunk $ BB.insertByteString b) *> lumps) =<< await

maybeM :: (Applicative m) => m b -> (a -> m b) -> Maybe a -> m b
maybeM _             action (Just a) = action a
maybeM defaultAction _       Nothing = defaultAction
mrkkrp commented 8 years ago

OK, you can go ahead with PR, but please preserve backward-compatibility in API.

ghost commented 8 years ago

Absolutely! I already have the code and it passes all of the prior tests. On Wed, 2016-08-03 at 23:43 -0700, Mark Karpov wrote:

OK, you can go ahead with PR, but please preserve backward- compatibility in API. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ghost commented 8 years ago

Would you like me to delay the PR until I add a set of tests to the test suite or just get the working code to you first? On Wed, 2016-08-03 at 23:43 -0700, Mark Karpov wrote:

OK, you can go ahead with PR, but please preserve backward- compatibility in API.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

     

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c5 5493e4bb","name":"GitHub"},"entity":{"external_key":"github/mrkkrp/zi p","title":"mrkkrp/zip","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/mo dules/aws/aws- bg.jpg","avatar_image_url":"https://cloud.githubusercontent.com/asset s/143418/15842166/7c72db34-2c0b-11e6-9aed- b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/mrkkrp/zip"}},"updates":{"snippets" :[{"icon":"PERSON","message":"@mrkkrp in #20: OK, you can go ahead with PR, but please preserve backward-compatibility in API."}],"action":{"name":"View Issue","url":"https://github.com/mrkkrp/zip/issues/20#issuecomment- 237466433"}}}

mrkkrp commented 8 years ago

@robertLeeGDM, Let's first see what you've got.

ghost commented 6 years ago

I thought this approach was about equal to the direct conduit approach of zip-stream, but I am realizing that this blind handle might solve the problem of simply computing the content length for populating an http header before streaming the zip. ( sz <- liftIO $ (IO.hSeek h IO.SeekFromEnd 0 >> IO.hTell h) before seeking 0 )

ghost commented 6 years ago

I have the code for blind handles, and I have used it in commercial production for some while without a problem. I had submitted it as a pull request, but the code was not formatted in accord with the standards used in that package. I did say I was going to fix it, but I'm a bit clueless with github, and so if I did reformat it I'd probably bungle the pull request. On Thu, 2018-01-25 at 02:04 +0000, Kanishka wrote:

I thought this approach was about equal to the direct conduit approach of zip-stream, but I am realizing that this blind handle might solve the problem of simply computing the content length for populating an http header before streaming the zip.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c5 5493e4bb","name":"GitHub"},"entity":{"external_key":"github/mrkkrp/zi p","title":"mrkkrp/zip","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/ass ets/143418/17495839/a5054eac-5d88-11e6-95fc- 7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent .com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed- b52498112777.png","action":{"name":"Open in GitHub","url":"https://gi thub.com/mrkkrp/zip"}},"updates":{"snippets":[{"icon":"PERSON","messa ge":"@kanishka-azimi in #20: I thought this approach was about equal to the direct conduit approach of zip-stream, but I am realizing that this blind handle might solve the problem of simply computing the content length for populating an http header before streaming the zip."}],"action":{"name":"View Issue","url":"https://github.com/mrkkr p/zip/issues/20#issuecomment-360337340"}}}

ghost commented 6 years ago

https://github.com/robertLeeGDM/zip <<< See the 'blind' branch.

On Thu, 2018-01-25 at 02:04 +0000, Kanishka wrote:

I thought this approach was about equal to the direct conduit approach of zip-stream, but I am realizing that this blind handle might solve the problem of simply computing the content length for populating an http header before streaming the zip.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c5 5493e4bb","name":"GitHub"},"entity":{"external_key":"github/mrkkrp/zi p","title":"mrkkrp/zip","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/ass ets/143418/17495839/a5054eac-5d88-11e6-95fc- 7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent .com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed- b52498112777.png","action":{"name":"Open in GitHub","url":"https://gi thub.com/mrkkrp/zip"}},"updates":{"snippets":[{"icon":"PERSON","messa ge":"@kanishka-azimi in #20: I thought this approach was about equal to the direct conduit approach of zip-stream, but I am realizing that this blind handle might solve the problem of simply computing the content length for populating an http header before streaming the zip."}],"action":{"name":"View Issue","url":"https://github.com/mrkkr p/zip/issues/20#issuecomment-360337340"}}}

ghost commented 6 years ago

Memory usage is great in tests. I would emphasize to future users that they need to ensure the filesystem where the handle is created has to have enough space for the largest possible zip file the users expect to produce.

In the long run, an approach that doesn't use a filesystem, even blind, is probably more compatible with serving streaming zips from a web application. The drawback here is that users have to wait a long time before the download actually starts for larger zip files.

UPDATE:

Update 2: After a few months in production, one of our user's chrome browsers gives up when the initial response takes too long. I started implementing async + browser poll approach. My ideal would be to speed up zip generation and keep everything synchronous, but I am not sure if I am constrained by the speed of writing buffers to disk. I haven't explored chunked transfer encoding yet.

ghost commented 6 years ago

We are stuck with the fact that zip was not created with streaming in mind. Zip is it's own worst enemy when it comes to that.