Spark runs leave nsfs object "busy"

motorman-ibm commented 3 years ago

Environment info

NooBaa Version: master 8/10 (includes issue 6698)
Platform: Kubernetes

Actual behavior

Spark query spark.sql("select count(*) from web_returns where wr_return_ship_cost > 2.0")

Expected behavior

File is not busy after spark finishes

Steps to reproduce

See More information - maybe aborted read?

More information - Screenshots / Logs / Other output

I am running Spark which reads a > 128MB object as a series of ranged reads. I have narrowed my test to one file (instead of all 25) and a dead simple query (select count(*) from web_returns where ...), the where clause being there to force a bypass of any indexes so it has to read the whole file. After the run is complete (soon after and 10 minutes later) lsof shows a bunch of (I also ran the lsof check 10 minutes later and had the same values)

COMMAND       PID     TID TASKCMD             USER   FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
node      3253532                             root   35r      REG               0,48 1053529104    1056781 /nsfs/fs10/gkafmtest/spark-s100/tpcds/web_returns.dat

The full summary shows this

cl1 - EP wr2fq - read_object 14 abort 14 lsof 154 (154/14 = 11)
cl2 - EP 5vllk - read_object 11 abort  9 lsof  99 ( 99/ 9 = 11)

the first line means node cl1 endpoint ending wr2fq had 14 read_objects issued, 14 aborts issued and 154 entries of the corresponding file in lsof same for second line except for other endpoint

What is really interesting is that for both the endpoints lsof entries = 11 x aborts issued (exactly) - so I assume some sort of correlation there.

For completeness that summary data was collected like this (noobaa logs were > noobaa-endpointname.log)

[root@clgk-noob-scl-1 ~]# grep -c read_object noobaa*.log
noobaa-endpoint-9cc895c5f-5vllk.log:11
noobaa-endpoint-9cc895c5f-wr2fq.log:14
noobaa-endpoint-lsof-2.log:0
noobaa-endpoint-lsof.log:0 
[root@clgk-noob-scl-1 ~]# grep -c abort noobaa*.log
noobaa-endpoint-9cc895c5f-5vllk.log:9
noobaa-endpoint-9cc895c5f-wr2fq.log:14
noobaa-endpoint-lsof-2.log:0
noobaa-endpoint-lsof.log:0
for i in clgk-noob-scl-1 clgk-noob-scl-2 clgk-noob-scl-3 clgk-afm-wk ; do echo $i; ssh $i lsof 2>&1 | grep -E "web_returns" | grep -v "no pwd" | wc -l; echo; done

I rarely see read_object_stream completions I think it is a race condition between "aborted" and "read_stream = await". At the time of the abort, read_stream is undefined, so read_stream.close() is not called

    req.on('aborted', () => {
      dbg.log0('request aborted:', req.path);
      if (read_stream && read_stream.close) read_stream.close();
    });
    res.on('error', err => {
        dbg.log0('response error:', err, req.path);
        if (read_stream && read_stream.close) read_stream.close();
    });

    read_stream = await req.object_sdk.read_object_stream(params, res);
    if (read_stream) {
        read_stream.on('error', err => {
            dbg.log0('read stream error:', err, req.path);
            res.destroy(err);
        });
        read_stream.pipe(res);
    }

Spark apparently issues requests in a rather rude way - one that triggers this issue (next post)

motorman-ibm commented 3 years ago

Spark will break the processing of a large enough object into manageable chunks and issue ranged reads as shown below

Aug-11 1:32:22.448 [Endpoint/7]    [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream { file_path: '/nsfs/fs10/gkafmtest/spark-s100/tpcds/web_returns.dat', start: 881436160, end: 1053529104 }

Spark in this case is probably interested in reading from 881436160 to 925507968 bytes. Once that is satisfied, I’m guessing the worker “hangs up” (closes the connection?)

Here is a plot of the read_object_stream requests from Spark. Note they all end at 1053529104 bytes (EOF)

To be fair, it might not be Spark, it might be Spark + HDFS driver (which Spark uses to access s3).

motorman-ibm commented 3 years ago

I don't know if this is helpful or not. This is a plot of logged times (when reads were logged, when aborts were logged). Since there is no way to associate an abort with its read_object_stream the vertical pairs of points are not linked to each other. It is likely however that each batch of 8 points is linked with the points above it (Spark is issuing 24 jobs to 8 workers at a time). Note this whole job is done in 3-4 seconds. Also it does appear that each read (Spark is only interested in 44,071,808 bytes per read - even though one is issued for around 1GB (start=0 end=EOF (entry 16 in the SQL Select plot)). That is consistent with the response deltas being flat for each group of 8 requests. So this graph also supports the "read 44MB and disconnect" theory.

It also implies "not all reads were aborted globally at one time"

akmithal commented 2 years ago

Hi @nimrod-becker , this defect was never screened. Can we have it screened atleast to know impact of it. Our test environments aren't using Spark workload, so such issues(if still present) would directly be seen at customer end only.

motorman-ibm commented 2 years ago

I think this should be deferred. I think the underlying issue was AFM compatibility. If the object is busy, it will mess up AFM eviction, so this should be looked at when AFM is supported

akmithal commented 2 years ago

Hi @romayalon , Please mark it for Enhancements. Thanks

github-actions[bot] commented 2 months ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

github-actions[bot] commented 1 month ago

This issue is stale and had no activity for too long - it will now be closed.

noobaa / noobaa-core