ska-sa / katdal

Data access library for the MeerKAT radio telescope
BSD 3-Clause "New" or "Revised" License
12 stars 13 forks source link

mvtoms.py throws connection reset by peer error #296

Open SpheMakh opened 4 years ago

SpheMakh commented 4 years ago

I'm using katdal 0.15 in an ubuntu 18.04 docker container

can   4 ( 599 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Added new field 1: 'J1939-6342' 19:39:25.03 -63:42:45.6
Wrote scan data (201912.166424 MiB) in 2285.384612 s (88.349316 MiBps)

scan   5 ( 602 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Traceback (most recent call last):
  File "/usr/local/bin/mvftoms.py", line 816, in <module>
    main()
  File "/usr/local/bin/mvftoms.py", line 566, in main
    scan_vis_data, scan_weight_data, scan_flag_data)
  File "/usr/local/bin/mvftoms.py", line 92, in load
    out=[vis, weights, flags])
  File "/usr/local/lib/python3.6/dist-packages/katdal/lazy_indexer.py", line 594, in get
    da.store(kept, out, lock=False)
  File "/usr/local/lib/python3.6/dist-packages/dask/array/core.py", line 951, in store
    result.compute(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 437, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/threaded.py", line 84, in get
    **kwargs
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.6/dist-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore.py", line 243, in get_chunk_or_zeros
    return self.get_chunk(array_name, slices, dtype)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 610, in get_chunk
    headers=headers, stream=True)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 587, in complete_request
    result = process(response)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 173, in _read_chunk
    chunk = read_array(data._fp)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 151, in read_array
    bytes_read = fp.readinto(memoryview(data.view(np.uint8)))
  File "/usr/lib/python3.6/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
SpheMakh commented 4 years ago

This is the ID of the data I'm trying to download 1584577476

SpheMakh commented 4 years ago

@ludwigschwardt any ideas?

ludwigschwardt commented 4 years ago

I could not recreate the error. The server typically slammed down the phone on their side with such an error message, indicating temporary overload which should go away if you try again later.

I'm still a bit sad since my latest improvements aim to catch these errors and turn them into flagged missing data without crashing the script. I'm keeping this open to remind me to catch this error too.

SpheMakh commented 4 years ago

I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go

ludwigschwardt commented 4 years ago

Interesting... I tried the following:

import katdal
from katdal.lazy_indexer import DaskLazyIndexer

d = katdal.open('...')
d.select(scans=5)   # which is where you are getting stuck
v, w, f = DaskLazyIndexer.get([d.vis, d.weights, d.flags], 0)
for n in range(602):
    print(n)
    DaskLazyIndexer.get([d.vis, d.weights, d.flags], n, out=[v, w, f])

It made it all the way to the end... Maybe try this on your setup.

bennahugo commented 4 years ago

Update your katdal Sphe. If you are coming in from Rhodes you need the latest and greatest! Had run into this network endpoint problem before

On Wed, Apr 8, 2020 at 2:10 PM Sphesihle Makhathini < notifications@github.com> wrote:

I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ska-sa/katdal/issues/296#issuecomment-610921381, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6QKHBFZMEYHQ5DESTLRLRSU7ANCNFSM4LUJL7YQ .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

ludwigschwardt commented 4 years ago

katdal 0.15 is pretty new (just pre-lockdown).

SpheMakh commented 4 years ago

This only happens when I'm running in a docker container. It works fine outside a container. @ludwigschwardt are there any containers that use mvftoms that you know of, maybe I made a mistake in building mine.

spassmoor commented 4 years ago

I have not had this issue running on my docker container that has mvftoms.py . I should also point out that I run it on one of the comm machines and this would mitigate any bad network problems.

SpheMakh commented 4 years ago

@spassmoor I'm running on com08. Can you share the Dockerfile?

SpheMakh commented 4 years ago

Thanks, I'll give it go.

ludwigschwardt commented 4 years ago

Just remember that this is a public thread, in case your zip contains sensitive info :-)

spassmoor commented 4 years ago

Other than my work email address and my preferred version of tornado I don't think there is anything sensitive in it.

sarrvesh commented 3 years ago

Hey folks, I encountered a similar issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt 's script above and that failed with the same time out error. Any ideas how to solve this issue?

bennahugo commented 3 years ago

Need more info here.

Is this inside a container? If so please check that your docker bridge is working properly by pinging or telnet inside your container.

If it is working can you indicate whether it fails part way through download or right at the start. There might be a misconfigured end point or something else not related to containerization.

On Thu, 25 Feb 2021, 20:42 Sarrvesh, notifications@github.com wrote:

Hey folks, I encountered the same issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt https://github.com/ludwigschwardt 's script above and that failed with the same time out error. Any ideas how to solve this issue?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ska-sa/katdal/issues/296#issuecomment-786118295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6XIUOWAQC2JKPEI6F3TA2KY3ANCNFSM4LUJL7YQ .

sarrvesh commented 3 years ago

I get the same error in a container environment and in a normal virtualenv installation. It seems to fail right away with the following error:

StoreUnavailable: Chunk '1596945366-sdp-l0/correlator_data/00168_00000_00000': HTTPConnectionPool(host='archive-gw-1.kat.ac.za', port=7480): Max retries exceeded with url: /1596945366-sdp-l0/correlator_data/00168_00000_00000.npy (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f7263391860>, 'Connection to archive-gw-1.kat.ac.za timed out. (connect timeout=30)'))

My network, otherwise, works just fine.

ludwigschwardt commented 3 years ago

Hi @sarrvesh, a connection timeout indicates that you could not even start to talk to the archive server, i.e. the phone just rings and rings and nobody picks up. This differs from a connection reset (the topic of this issue), which is the server slamming down the phone in the middle of your conversation.

I see that you are trying to connect to port 7480. Are you on a machine in the CHPC cluster? If not, you'll need to connect to port 443, aka https, and use a token as provided by the RDB link button on the archive. So instead of

d = katdal.open('http://archive-gw-1.kat.ac.za:7480/1596945366/1596945366_sdp_l0.full.rdb')

try

d = katdal.open('https://archive-gw-1.kat.ac.za/1596945366/1596945366_sdp_l0.full.rdb?token=<your-token>')

I managed to download dump 168 with both methods just now, so the server is up and your dataset is intact. That's the good news 😄

This issue also occurs if you download the RDB file to your local disk and then open it via

d = katdal.open('1596945366_sdp_l0.full.rdb')

That trick only works on the CHPC cluster, or if you also copied all the data to your local disk, since the RDB file only contains the 7480 URL and won't know about the token.

sarrvesh commented 3 years ago

Ah, interesting. Yeah, that works. Thanks very much.

ludwigschwardt commented 3 years ago

Pleasure!

I now remember that there's another option with the local RDB file to feed in the token - treat it like an URL:

d = katdal.open('1596945366_sdp_l0.full.rdb?token=<your-token>')

Although I'm not sure if that will go the https route...