gulpjs / glob-stream

Readable streamx interface over anymatch.
MIT License
178 stars 51 forks source link

Lists only 16 or less files when a glob string including no globstar. #112

Closed sttk closed 1 year ago

sttk commented 2 years ago

What were you expecting to happen?

Lists more than 16 files in a directory.

What actually happened?

Lists only 16 files with a glob including no globstar.

Please give us a sample

$ npm init -y
$ npm install glob-stream
$ mkdir in out
$ for n in $(seq 100); do echo "hello" > in/test_${n}.txt; done
$ cat <<EOF > test.js
const { pipeline, Transform, Writable } = require('stream');                    
const GlobStream = require('glob-stream');
const fs = require('fs');
const gs = GlobStream('./in/*.txt');
let count = 0;
const ts = Transform({
  objectMode: true,
  transform: (f, enc, cb) => { ++count; cb(null, f.path + '\n'); },
});
const ws = fs.createWriteStream('./out/out.txt');
pipeline(gs, ts, ws, () => console.log('count =', count));
EOF
$

Terminal output / screenshots

$ node test.js     
count = 16
$ wc -l out/out.txt
      16 out/out.txt

Please provide the following information:

Additional information

When replaces the above glob string to './in/**/*.txt', the result was as follows:

$ node test.js     
count = 100
$ wc -l out/out.txt
     100 out/out.txt
phated commented 2 years ago

@sttk if this is the full example, you've forgotten to create a Writeable stream, so it never sinks and you hit the highWaterMark

phated commented 2 years ago

Nvm, I see the writeable stream is fs

phated commented 2 years ago

However, I'm very confused because I was 99% sure that you aren't allowed to use an object stream with fs writeStream. It's a text based stream.

sttk commented 2 years ago

@phated This cause is that glob process multiple readdir only when a given glob string include globstar.

https://github.com/isaacs/node-glob/blob/3bfec21dd180ddf4672880176ad33af6296a167e/glob.js#L360

sttk commented 2 years ago

@phated

you aren't allowed to use an object stream with fs writeStream

In this Transform, I changed an object to a string. In the above sample code, the Transform convert an object to a string.

const ts = Transform({
  objectMode: true,
  transform: (f, enc, cb) => { ++count; cb(null, f.path + '\n'); },
});
sttk commented 2 years ago

The following is the logs of isGlobStar here by the above sample code when a glob string is './in/*.txt':

$ node test.js
glob isGlobStar false
count = 16
$

And the logs when a glob string is './in/**/*.txt':

$ node test.js
glob isGlobStar true
glob isGlobStar false
glob isGlobStar false
glob isGlobStar true
glob isGlobStar false
...
glob isGlobStar false
glob isGlobStar true
count = 100
$
sttk commented 2 years ago

I noticed this issue when I run the test of vinyl-fs. This test failed.

sttk commented 2 years ago

https://github.com/isaacs/node-glob/blob/3bfec21dd180ddf4672880176ad33af6296a167e/glob.js#L412-L432

To skip this part lists all files even if a glob string includes no globstar.

sttk commented 2 years ago

The following code which uses GlobStream in readable.js directly works well.

$ cat <<EOF > test2.js
const { pipeline, Transform } = require('stream');
const fs = require('fs');
const GlobStream = require('glob-stream/readable');
const gs = GlobStream('./in/*.txt', []);
let count = 0;
const ts = Transform({
  objectMode: true,
  transform: (f, enc, cb) => { ++count; cb(null, f.path + '\n'); },
});
const ws = fs.createWriteStream('./out/out.txt');
pipeline(gs, ts, ws, () => console.log('count =', count));
EOF
$ node test2.js
count = 100

The following code which also uses pumpify and unique-stream like index.js works well.

$ cat <<EOF > test3.js
const { pipeline, Transform } = require('stream');
const fs = require('fs');
const pumpify = require('pumpify');
const GlobStream = require('glob-stream/readable');
const UniqueStream = require('unique-stream');
const gs = GlobStream('./in/*.txt', []);
const uniq = UniqueStream();
const pump = pumpify.obj(gs, uniq);
let count = 0;
const ts = Transform({
  objectMode: true,
  transform: (f, enc, cb) => { ++count; cb(null, f.path + '\n'); },
});
const ws = fs.createWriteStream('./out/out.txt');
pipeline(pump, ts, ws, () => console.log('count =', count));
EOF
$ node test3.js
count = 100

And the following code which also uses ordered-read-stream like index.js doesn't work well.

$ cat <<EOF > test4.js
const { pipeline, Transform } = require('stream');                              
const fs = require('fs');
const pumpify = require('pumpify');
const GlobStream = require('glob-stream/readable');
const UniqueStream = require('unique-stream');
const Combine = require('ordered-read-streams');
const gs = GlobStream('./in/*.txt', []);
const agg = new Combine(gs);
const uniq = UniqueStream();
const pump = pumpify.obj(agg, uniq);
let count = 0;
const ts = Transform({
  objectMode: true,
  transform: (f, enc, cb) => { ++count; cb(null, f.path + '\n'); },
});
const ws = fs.createWriteStream('./out/out.txt');
pipeline(pump, ts, ws, () => console.log('count =', count));
EOF
$ node test4.js 
count = 16
sttk commented 2 years ago

@phated I suspect that this issue is due to ordered-read-streams from the above results.

sttk commented 2 years ago

This issue is not due to ordered-read-streams. The cause is seemed that glob-stream/readable uses highWaterMark as limit of read count.

$ cat <<EOF > test3.js
const fs = require('fs');
const { Readable, pipeline } = require('stream');
const Combine = require('ordered-read-streams');
let cnt = 1;
let max = 100;
let read_count = 0;
const rs = new Readable({
  highWaterMark: 20,
  objectMode: true,
  read(size) {
    read_count++;
    for (i = 0; i < size; i++, cnt++) {
      if (cnt > max) {
        this.push(null);
        break;
      }
      this.push(`${cnt}\n`);
    }
  },
});
const cs = new Combine([rs]);
let drain_count = 0;
const ws = fs.createWriteStream('out.txt', { highWaterMark: 10 });
ws.on('drain', () => {
  drain_count++;
});
ws.on('finish', () => {
  console.log(`read_count = ${read_count}`);
  console.log(`drain_count = ${drain_count}`);
});
pipeline(cs, ws); 
EOF
$ node test5.js 
read_count = 6
drain_count = 24
$ wc -l out.txt
     100 out.txt
$
phated commented 2 years ago

The cause is seemed that glob-stream/readable uses highWaterMark as limit of read count.

That means you aren't draining the streams correctly.

btakita commented 2 years ago

I'm having a similar issue when trying to create a simple writable stream with a listener to the 'data' event. Only 16 results are read before the process completes. Could you provide a simple example where a large directory is globbed, console.info is called for each data event while listening to all of the files that should be returned by the glob?

Here is an example where only the first 16 data events are called.

// 16 results
glob_stream(join(process.cwd(), '*'))
  .on('data', data=>console.info(data))
  .on('end', ()=>console.info('done'))

// 16 results
glob_stream(join(process.cwd(), '*')).pipe(new Writable({
  objectMode: true,
  write(chunk, _encoding, next) {
    console.debug(chunk)
    next()
  }
}))

I also tried using the mississippi library with .pipe, as you do in your tests & cannot figure out how to have the entire glob complete.

Even piping to process.stdout throws an error since this library streams objects. This seems very basic, but I can't find any examples of consuming streaming objects properly for any library. Hopefully this thread to serve posterity by resulting with a simple example of properly consuming an object stream.

// TypeError [ERR_INVALID_ARG_TYPE]: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object
glob_stream(join(dir_path, '*')).pipe(process.stdout)
btakita commented 2 years ago

Note that spawn & readline work as expected.

import { spawn } from 'child_process'
import * as readline from 'readline'
// All of the files in `dir_path` are printed to the console
readline.createInterface({ input: spawn('ls', ['-1'], { cwd: dir_path }).stdout })
    .on('line', $=>console.debug($))
phated commented 1 year ago

I've added 2 tests for this at https://github.com/gulpjs/glob-stream/pull/119/commits/790c0cb5f1cd81ace1024b2cf0d41756c875db4c