Shorten multiple URLs at once

buhman commented 7 years ago

What does the API for this look like? Why is it useful?

kyrias commented 7 years ago

For the former I was thinking either supporting something like how ix.io supports multiple files, or having the URLs encoded in some way that doesn't use a specific delimiter and then splitting on that.

Mostly I want support for it for my script that renders HTML emails to plain-text and then shortens URLs longer than what is reasonable for CLI use, because doing them one by one can get be quite slow for larger emails. On the other hand I could potentially rewrite it and have the shortener be run async.

buhman commented 7 years ago

something like how ix.io supports multiple files

Absolutely; this makes sense. Because the current API response isn't a collection/list, it seems inconsistent/bad to conditionally return a collection (robust callers would then need to implement the inverse logic). Could this be a different endpoint instead? Something like:

https://ptpb.pw/pastes/
https://ptpb.pw/pastes/?short-url=non-null

I've also been tossing around the idea of versioning the API, and/or making a completely separate DebtFree™ service.

Usage would be something like:

curl -F content=@<(<<< https://foo.bar) -F content=@<(<<< https://spam.eggs) https://ptpb.pw/pastes/?short-url=true

The response might look like:

[
  {
    # paste attributes
  },
  {
    # paste attributes
  }
]

Does that look like what you expected?

I suppose one potential problem might be matching up the response items with the request--ix.io for example is not only inconsistent about re-ordering the response, but also has a very buggy multipart parser: https://ptpb.pw/bQrV (see the trailing truncated multipart boundaries included in the paste responses, and inconsistent re-ordering with test32).

potentially rewrite it and have the shortener

Is this code on github? I wouldn't be surprised if multiple parallel requests ended up being faster anyway, and helping you with this is definitely going to be easier than trying to make pb not suck.

kyrias commented 7 years ago

Hm, that might make sense. Or something like

curl -F content:1=@<(<<< https://foo.bar) -F content:2=@<(<<< https://spam.eggs) https://ptpb.pw/pastes/?short-url=true

and then getting something like

{
    1: {
        # details
    },
    ...
}

Using the id's passed to -F as the keys. This could also potentially be an arbitrary string I guess, hm. Might make sense to just implement it in a clean-slate implementation instead though, yeah.

potentially rewrite it and have the shortener

Is this code on github? I wouldn't be surprised if multiple parallel requests ended up being faster anyway, and helping you with this is definitely going to be easier than trying to make pb not suck.

It's essentially a slightly modified version of the tiny.pl script available here, but I'm planning on rewriting it from scratch anyway, just been to lazy to do so as of yet.

buhman commented 7 years ago

($line =~ /(\w+:\/\/\S+)/)

https://regex101.com/r/DwZzBf/1 Is that the behavior you actually want? I feel like limiting replacement to just http and https would make more sense.

So it looks like the expected behavior is this:

provided a filename and mimetype
if the mimetype is text/html
- use elinks to render the html as text
find everything in the text that looks like a (http/https) URL
- replace it with a shorturl
write the result on stdout

Doing multiple concurrent replacements means we can't use ~the read 1 line → emit 1 line memory optimization from tiny.pl.~ https://github.com/ptpb/pb/issues/195#issuecomment-306669397

I think something slightly more memory-optimized than simplest way to handle the inner loop with concurrency might look like:

python>=3.6

out_queue = collections.deque()

async def pump_queue(wait=False):
    while True:
        try:
            line = out_queue.popleft()
        except IndexError:
            break
        if isinstance(line, str):
            yield line
        elif line.done():
            yield line.result()
        elif wait:
            yield await line
        else:
            out_queue.appendleft(line)
            break

for line in file:
    if the_line_has_spooky_urls(line):
        line = await function_that_fixes_the_line_in_future(line)
    out_queue.extend(line)
    # write lines that are ready to go now; while maintaining line ordering
    async for line in pump_queue():
        sys.stdout.write(line)

# write remaining lines; waiting for each line to become available
async for line in pump_queue(wait=True):
    sys.stdout.write(line)

In the best case (pb is infinitely fast, and the email is infinitely long--or, the email has no URLs), memory utilization should be™ constant. In the probable-case, requests will overlap, and memory utilization will fluctuate depending on the size of the text between the queued URLs. In the worst case, memory utilization will be: constant + len(body) + (8 * len(body.splitlines()) + (constant * number_of_urls_in_body)

buhman commented 7 years ago

# Note: using while (instead of for) b/c for supposedly loads # everything into memory - no reason to load large emails into memory

That's pretty fucking hilarious, because, according to perldoc.

The collected standard output of the command is returned; standard error is unaffected. … Like system, backticks put the child process exit code in $?.

So the child process must exit before the result of that expression can be evaluated, and you can't read from broken pipes, so text isn't the magical list-emulating-pipe-proxy the tiny.pl author thinks it is.

buhman commented 7 years ago

I benchmarked a hacked up version of this; the currently-unit-tested parts of the code are on github; this currently just includes a regex stream-parser context thing, and the future queue logic from above.

To compare performance, I trivially modified tiny.pl to replace URLs in the same way microscopic does: gist, then ran them both over a random spam email from my old maildir.

The raw benchmark data: https://gist.github.com/buhman/009179b09d996b55fc1565bd9bac2710; below I discard the outlier 5-second result from tiny.pl.

Results (lower is better):

	average	maximum	x/tiny.pl	x/micro	x/single
single shorturl	0.311	0.335	0.13	0.71	1.0
tiny.pl	2.603	2.676	1.0	5.7	8.0
microscopic	0.462	0.472	0.18	1.0	1.4

Predictions/trends:

While tiny.pl execution time is very obviously ~linear to the number of URLs in the input, if I did a super fancy multiple-line graph (do it?) for time/number-of-urls, microscopic's performance, for a sufficiently large number of URLs, would also approach linear, because, contrast to microscopic, pb's concurrency capabilities are non-infinite. This is probably fine for real emails that don't contain >thousands of URLs though.

Additional optimizations not represented here:

the test data contains multiple identical URLs; microscopic could gain unfair advantage by caching previous url translations in memory

kyrias commented 7 years ago

I should have expected you to go slightly over-board with this, shouldn't I, heh.

So it looks like the expected behavior is this:

provided a filename and mimetype

if the mimetype is text/html

use elinks to render the html as text

find everything in the text that looks like a (http/https) URL

replace it with a shorturl

write the result on stdout

Pretty much. I've just used the current regex as-is because it's been "good enough" in practice, and been too lazy to actually replace it as long as it mostly works.

buhman commented 7 years ago

over-board with this

I would describe this more as "inspiration" instead.

I present a pre-release version of microscopic, a generic pipeline DSL and execution engine.

Usage:

pip install microscopic==0.1.1.dev3
curl -L https://ptpb.pw/gSpk > ./example-pipeline.yaml

# copy arbitrary url-tastic HTML to ./message.html
microscopic ./example-pipeline.yaml

Currently, the only supported pipeline is the pattern_queue pipeline, which has a (simplified) data flow like this:

read_transport → pattern → <no match> → <no op> → write_transport
                         ↳ <match>  → processor ⬏

As suggested in the name, the pattern_queue pipeline executes match processors concurrently while having deterministic output ordering via an internal queue.

This pipeline is assembled using arbitrary combinations of components registered via pkg_resources; hopefully it should be obvious how the mapping is supposed to work by reading the example.

Let me know what you think.

kyrias commented 7 years ago

That does look rather nice indeed.

ptpb / pb

Shorten multiple URLs at once #195