python-trio / trio

Trio – a friendly Python library for async concurrency and I/O
https://trio.readthedocs.io
Other
6.06k stars 330 forks source link

Add collection of worked examples to tutorial #472

Open njsmith opened 6 years ago

njsmith commented 6 years ago

We should refactor the tutorial into an initial part that's similar to what we have now, or the middle part of my talk, and then a collection of examples that also serve as excuses to go into more depth on particular topics.

I expect the list will grow over time, but here are some ideas. (Actually the main reason I'm filing this is to have a place collect these so I don't lose them.)

Possibly some of these could be combined or form sequences, eg echo server -> catch all handler -> nursery.start

Fuyukai commented 6 years ago

An asyncio.gather-like (collect the results of all tasks and return the results) would be a good example (as you've said before on Gitter iirc)

njsmith commented 6 years ago

Oh yeah, good idea! (And we should use the terms asyncio.gather and Promise.all in the headline, because people seem to be looking for those.)

njsmith commented 6 years ago

Oh, see also #421, which is a partial duplicate and has some more discussion of the nursery-based examples.

njsmith commented 6 years ago

Oh duh, here's another one: an example of implementing a custom protocol, by combining a sansio protocol with the stream interface. (Probably some simple line-oriented or netstring-oriented thing. This is one of the motivations for why I started working on sansio_toolbelt. I should make it actually usable...)

smurfix commented 6 years ago

A walkthrough for converting a sync protocol to Trio might also make sense.

WRT trio-asyncio: maybe simply refer to it. I do need to add some example that shows how to convert from asyncio to trio-asyncio to trio, and how that improves the code. ;-)

nicoddemus commented 6 years ago

I do need to add some example that shows how to convert from asyncio to trio-asyncio to trio, and how that improves the code. ;-)

Would love to see that. 😁

njsmith commented 6 years ago

Something like @jab's HTTP CONNECT proxy from https://github.com/python-trio/trio/pull/489#issuecomment-379455747 might be interesting too.

(Possibly rewritten to use h11 ;-).)

njsmith commented 6 years ago

@oremanj's as_completed here might be interesting as the basis for a gather equivalent: https://gitter.im/python-trio/general?at=5ad186345d7286b43a29af53

Maybe examples of testing and debugging would be good too. (Testing might just refer to the pytest-trio docs. Which we still need to write...)

oremanj commented 6 years ago

There was some more discussion of this on Gitter today, which resulted in rough drafts of reasonable implementations for gather and as_completed both: https://gitter.im/python-trio/general?at=5ae22ef11130fe3d361e4e25

@N-Coder pointed out that there are a number of useful "asyncio cookbook" type articles floating around, and probably trio would benefit from something that serves a similar role. I think that's the same idea in this thread, but the examples are potentially helpful:

njsmith commented 6 years ago

527 is a common question; we should have something for it.

njsmith commented 6 years ago

As mentioned in #537, a UDP example would be good. This would also be a good way to demonstrate using trio.socket directly.

The example in that comment thread is kind of boring. Doing a dns query or ntp query would be more interesting. [edit: see notes-to-self/ntp-example.py for an ntp query example.]

njsmith commented 6 years ago

It would be good to have an example that discusses the subtleties of aclose: in general, cancellation means "stop what you're doing ASAP and clean up". But if what you're doing is cleaning up... what does that mean? Basically just "clean up ASAP" → graceful vs forceful close.

Maybe this would fit in with an example that wraps a stream in a higher-level protocol object, so we have to write our own aclose? Esp. if the protocol has some kind of cleanup step.

njsmith commented 6 years ago

Interaction between __del__ and trio is another thing we should discuss somewhere. (It's very tricky. The obvious problem is that __del__ isn't an async method, so it can't await anything. But it's actually much worse than that. __del__ methods are called at arbitrary moments, so they have all the same complexity as signal handlers. Basically the only operation that's guaranteed to be usable from __del__ is TrioToken.run_sync_soon. At least this is better than asyncio, where AFAICT literally no methods are guaranteed to be usable from __del__, but it's still extremely tricky.)

njsmith commented 5 years ago

Channel examples – we might move the ones that are currently in reference-core.rst here.

It would also be good to have an example of tee, if only to have something to point to when explaining that it's not what ReceiveChannel.clone does. The simplest version is something like:

async def send_all(value, send_channels):
    async with trio.open_nursery() as nursery:
        for send_channel in send_channels:
            nursery.start_soon(send_channel.send, value)

But then there are complications to consider around cancellation, and error-handling, and back-pressure...

njsmith commented 5 years ago

Using a buffered memory channel to implement a fixed-size database connection pool

njsmith commented 5 years ago

Re previous message: @ziirish wrote a first draft: https://gist.github.com/ziirish/ab022e440a31a35e8847a1f4c1a3af1d

njsmith commented 5 years ago

Zero-downtime upgrade (via socket activation, socket passing, unix-domain socket + atomic rename?)

njsmith commented 5 years ago

example of how to "hide" a nursery inside a context manager, using @asynccontextmanager (cf https://github.com/python-trio/trio/issues/882#issuecomment-457962244)

[Edit: And also, what to do in case you need to support async with some_object: ...]

N-Coder commented 5 years ago

Maybe some information about how to do Tasks in trio, like here: https://github.com/python-trio/trio/issues/892#issuecomment-459195578

[Note: I think this means asyncio.Task-equivalents -njs]

njsmith commented 5 years ago

As requested by @thedrow (e.g. #931), it would be great to have a simple worked example of wrapping a callback/fd-based C library and adapting it to Trio style, demonstrating wait_readable/wait_writable. We'll want to take special care to talk about cancellation too, because that's important and easy for newcomers to forget about.

I'm not sure what the best way to do this would be. Callback-based C libraries tend to be complicated and have idiosyncratic APIs. Which to some extent is useful for an example, because we want to show people how to handle their own complicated and idiosyncratic API, but it can also be problematic, because we don't want to force people to go spend a bunch of time learning about details of some random library they don't care about.

We could write our own toy library just for the example, in C or Rust or whatever.

We could pick an existing library that we think would be good pedagogically. Which one? Ideally: fairly straightforward interface, accomplishes a familiar task, already has a thin Python wrapper or it's trivial to make one through cffi. Some possibilities:

thedrow commented 5 years ago

One of the things Celery will need to do is to write an sd_notify implementation with async support. It should be fairly easy to write in Rust/C and there's only one socket to integrate with.

njsmith commented 5 years ago

@thedrow we already have an issue for sd_notify and friends – let's discuss that over on #252. This thread is about examples to teach people about trio, and AFAIK there isn't anything pedagogically interesting about sd_notify – it's pretty trivial to implement in pure python, or if you have an existing C/Rust implementation you like then it's trivial to bind in python and the binding won't care what io library you're using.

oremanj commented 5 years ago

Some examples of commonly-desired "custom supervisors" would be useful, e.g. the dual-nurseries trick in #569.

wgwz commented 5 years ago

As mentioned here in gitter earlier today, when I was working the Semaphore primitive I felt unsure of how I was using it. I'd appreciate some examples and documentation on the common use cases of the synchronization primitives shipped with trio and will try to help with this. There is existing documentation here that should be considered too.

njsmith commented 4 years ago

Here's a sketch for a web spider, including a tricky solution to figuring out when a circular channel flow is finished: https://gist.github.com/njsmith/432663a79266ece1ec9461df0062098d

snedeljkovic commented 4 years ago

Hey @njsmith I just tested your spider, and it seems there is an issue regarding the closing of the send channel clones. Here is a mwe with a proposed fix:

import trio
import random
from collections import deque

WORKER_COUNT = 10

tasks = deque(i for i in range(103))
results = []

async def worker(worker_id, tasks, results, receive_chan):

  async def process(task):
    await trio.sleep(random.uniform(0, 0.1))
    return task

  async for send_chan, task in receive_chan:
    async with send_chan:
      result = await process(task)
      results.append((result, worker_id))
      if tasks:
        await send_chan.send((send_chan.clone(), tasks.popleft()))
        continue
      print('Worker {} reached an empty queue.'.format(worker_id))
      break

async def batch_job(tasks, results):
  send_chan, receive_chan = trio.open_memory_channel(float("inf"))
  for _ in range(WORKER_COUNT):
    await send_chan.send((send_chan.clone(), tasks.popleft()))
  async with trio.open_nursery() as nursery:
    for worker_id in range(WORKER_COUNT):
      nursery.start_soon(worker, worker_id, tasks, results, receive_chan)

trio.run(batch_job, tasks, results)

If I remove the break in the async for send_chan, task in receive_chan loop the program hangs. Could you explain why exactly this fix works? Is there a more correct way to fix the issue?

smurfix commented 4 years ago

You're not closing the original send_chan. Also you fill your queue first and start the workers afterwards, which in a real program (i.e. with a non-infinite queue) isn't a terribly good idea.

smurfix commented 4 years ago

Anyway, why is your code so complicated?

Simplified:

tasks = deque(range(103))
results = []

async def worker(worker_id, results, receive_chan):

  async def process(task):
    await trio.sleep(random.uniform(0, 0.1))
    return task

  async for task in receive_chan:
    result = await process(task)  
    results.append((result, worker_id))  
  print('Worker {} reached an empty queue.'.format(worker_id))

async def batch_job(tasks, results):
  send_chan, receive_chan = trio.open_memory_channel(WORKER_COUNT)
  async with trio.open_nursery() as nursery:
    async with send_chan:
      for worker_id in range(WORKER_COUNT):
        nursery.start_soon(worker, worker_id, results, receive_chan)
      for t in tasks:
        await send_chan.send(t)
      await send_chan.aclose()

trio.run(batch_job, tasks, results)

Note that you don't need (and don't want) an infinite queue; limiting it to the number of workers is more than sufficient.

snedeljkovic commented 4 years ago

@smurfix Thanks for the explanation. Could you please elaborate on why it is a bad idea to first fill the queue and then start the workers? I'm building a concurrent scraper that should roughly speaking be given a batch of urls and then scrape them concurrently. Because of retries some links may get added back to the queue. Yeah the code is more complicated than it should be, but I figured it's simple enough to get the point.

smurfix commented 4 years ago

OK, yeah, if you're scraping then your original code makes more sense. ;-)

The point is that you should never use an infinite queue. Infinite queues tend to fill memory. Also, the point of a queue is to supply back-pressure to the writers (i.e. your scraper) to slow down because the workers can't keep up. This, incidentally, significantly improves your chances of not getting blocked by the scraped sites.

OK, now you have 103 sites in your initial list, 10 workers, and a 20-or-however-many job queue. If you fill the queue first, the job doing that will stall, and since there's no worker running yet you get a deadlock.

snedeljkovic commented 4 years ago

Thank you for pointing out the issue with the infinite queue. I don't understand how filling a regular deque can stall. The workers either add the response to the result or add the url back into the queue (of course there are various retry limits, timeouts etc...). You mentioned a "20-or-however-many job queue". I'm adding all my urls to the queue first. All other additions are done by the workers themselves in case of retries. Regarding rate limiting I do have a mechanism based on timing, and I do not rely on any sort of queue for that if that's what you meant.

Edit: My use case involves a low number of thousands of links at most, so I can afford to keep the entire queue in memory ie. I don't need a separate queue for workers to feed on and another to fill the first one.

njsmith commented 4 years ago

@smurfix you do need something like an infinite queue for a traditional recursive web scraper, since each consumer task may produce an arbitrary number of new work items. So any finite queue limit could produce a deadlock, when all the consumer tasks are blocked waiting for another consumer task to pull from the queue...

It sounds like I misunderstood @snedeljkovic's problem, and I was assuming a single starting URL, while they actually have the full list of urls up front. So in my gist, I took a kind of shortcut that works for the single URL case, of sending in the original send channel without cloning it, but I didn't point out the tricky bit there, so it wasn't obvious how to correctly generalize to a case with multiple starting urls. That's useful to know for future docs – we need to cover the multiple URLs case and explain the logic behind starting it up correctly.

@snedeljkovic It sounds like you figured out what you need to know? If you still have questions, please let us know – but let's move to a new issue or a thread on trio.discourse.group, so we can focus on your questions properly and keep this thread for tutorial update ideas.

smurfix commented 4 years ago

@njsmith Yeah, I realized that but got confused. (It's been a long day.)

In any case, boiling this down to the essentials. you need

Doing this with a lot of cloned senders is perhaps not the most efficient way, thus I have encapsulated the idea in a (somewhat minimal) class, and packaged this version of @snedeljkovic's code in a gist:

https://gist.github.com/smurfix/c8efac838e6b39bedc744a6ff8ca4405

Might be a good starting point for a tutorial chapter.

snedeljkovic commented 4 years ago

@smurfix Wow, thank you very much for the effort! I'm new to this kind of programming so this is of great help to me. @njsmith Sorry, I won't pollute this thread any more ;)

smurfix commented 4 years ago

It's good to keep a basic tenet in mind: don't code to the tools you have – that leads to bad design. Code to the tools you need, and build them with the tools you have. Recurse if necessary. ;-)

jtrakk commented 4 years ago

"How do I"...

decentral1se commented 4 years ago

Quick note about getting started with writing a protocol and wanting to have a way of having pluggable transports (from the Gitter chat just now). For some reason, I couldn't bend my head around this for a while and this might help others:

I'm reading from https://trio.readthedocs.io/en/stable/reference-io.html where I see "it lets you write generic protocol implementations that can work over arbitrary transports". This gave me the idea that I can focus on writing my protocol (encoding, sending, receiving, decoding logic, etc.) and have my Stream class become an instance of MemoryStream, TCPStream, UTPStream, etc. later on

ahh, I see. So the idea is that you implement your protocol as a function or class that takes a Stream object, and uses its methods to send/receive bytes and your code does whatever higher-level thing it wants to with those bytes

so maybe you'd do like my_proto_over_memory_stream = MyProtocol(memory_stream) or my_proto_over_tcp = MyProtocol(tcp_stream) composition, not inheritance

njsmith commented 4 years ago

Here's an example that could be turned into a good cookbook tutorial: why TCP flow control can cause deadlocks, and a way to avoid deadlocks when pipelining: https://gist.github.com/njsmith/05c61f6e06ca6a23bef732fbf5e832e6

decentral1se commented 4 years ago

https://github.com/python-trio/trio/issues/1141 really helped me a lot and I think it would be great to include an example of thinking about providing a open_foo function when trying to design an API for whatever objects you expose to the end-user.

jtrakk commented 4 years ago

From chat

Q: Anyone has experience has experience with adapting callback-based APIs into Trio? I'm looking at pycurl (http://pycurl.io/docs/latest/curlmultiobject.html). I'm wondering if it's possible.

A: the general pattern is to first set up a a callback that will call trio.hazmat.reschedule, and then call trio.hazmat.wait_task_reschedule to put the task to sleep until the callback comes in or the operation is cancelled

merlinz01 commented 3 months ago

I'd like to see a table mapping as many Go idioms to Trio idioms as possible, for those like me who have worked with Go. For starters (probably not entirely correct as I'm fairly new to Trio):


ch := make(chan X)                      | send, recv = trio.open_memory_channel(0)

ch := make(chan X, 37)                  | send, recv = trio.open_memory_channel(37)

close(ch)                               | send.close(); recv.close()

x = <-ch                                | x = await recv.receive()

ch <- X                                 | await send.send(X)

done := make(chan any)                  | with trio.open_nursery() as n:
go func(done)                           |     n.start_soon(func)
<-done                                  |

select { case <-ch: XXX }               | try: recv.receive_nowait()
                                        | except trio.WouldBlock: pass
                                        | else: XXX

for x := range ch { XXX }               | async with recv:
                                        |     async for x in recv: XXX

sync.Mutex                              | trio.Lock

context.Context.Get(XXX)                | XXX = contextvars.ContextVar()
TeamSpen210 commented 3 months ago

@merlinz01 It might make sense indeed to have a table with a few different languages for comparison. I'm not too familiar with Go, but on the Trio side:

merlinz01 commented 3 months ago

Thanks for the corrections! It's only a couple days since I learned about Trio, so I don't have all the Trio idioms mastered.

For the go func() example, this is more what I was thinking of:

num_workers = 5
ch := make(chan MyFunctionResult, num_workers)
// Start the workers
for x := range num_workers {
    go send_the_result_over_chan(ch)
}
// Wait for the results
for x := range num_workers {
    res := <-ch
    if res.err != nil {
        [handle error]
    }
}
close(ch)

vs.

num_workers = 5
try:
    # Start the workers
    with trio.open_nursery() as nursery:
        for x in range(num_workers):
            nursery.start_soon(return_the_result)
    # implicitly wait for the results
except* Exception:
    [handle error]

For the select example, I think this is more accurate:

select {
    case <-ch:
        XXX
    default:
        YYY
}

vs.

try:
    recv.receive_nowait()
except Trio.WouldBlock:
    YYY
else:
    XXX

Go is super for networking and concurrency, but as it is a compiled language with monolithic binaries it is not quite as flexible. For my project I needed Python's extensibility and dynamic code compilation, so I'm using Trio's async functionality.

I suppose there would be other comparisons besides this that would also be helpful to programmers coming from other languages.