jeevan-vaishnav / Streams

Node.JS Core Concepts
1 stars 0 forks source link

BIG Data Read and Write | TB #2

Closed jeevan-vaishnav closed 1 week ago

jeevan-vaishnav commented 1 week ago

The provided code is designed to handle reading and writing large files efficiently using streams. It uses the fs module in Node.js to manage file handling. Here's a pointer-wise breakdown of how this code works and how it helps to process large data (even terabytes) efficiently.

Breakdown of the Code Immediately Invoked Async Function:

javascript Copy code (async () => { ... })(); This is an immediately invoked async function. It ensures that the code inside is executed right away and allows the use of await for handling promises.

Opening Files:

javascript Copy code const fileHandleRead = await fs.open("big-gigantic.txt", "r"); const fileHandleWrite = await fs.open("dest.txt", "w"); fileHandleRead: Opens the file big-gigantic.txt for reading. This is the file containing the large amount of data (could be a file of TB size). fileHandleWrite: Opens the file dest.txt for writing. This is where the read chunks will be written. Creating Streams:

javascript Copy code const streamRead = fileHandleRead.createReadStream({ highWaterMark: 64 1024 }); const streamWrite = fileHandleWrite.createWriteStream(); streamRead: A readable stream is created with a highWaterMark of 64 1024 (64 KB). This means that data will be read in chunks of 64 KB at a time. The highWaterMark controls how much data is read into memory at once. streamWrite: A writable stream is created for writing to dest.txt. It will write the chunks read from streamRead. Reading Data in Chunks:

javascript Copy code streamRead.on("data", (chunk) => { ... }); This event listener triggers when streamRead reads a chunk of data. Each chunk is processed inside this callback function. Writing Data to Destination File:

javascript Copy code if (!streamWrite.write(chunk)) { console.log("streamRead.Pause"); streamRead.pause(); } The chunk read from streamRead is written to the streamWrite writable stream. The write() method returns false when the internal buffer is full and cannot accept more data at the moment. If this happens, the reading process is paused using streamRead.pause(). This prevents overloading the memory and ensures efficient handling of large data. Resuming the Stream When Buffer Drains:

javascript Copy code streamWrite.on("drain", () => { console.log("Drained"); streamRead.resume(); }); The drain event is emitted when the streamWrite buffer has been emptied, meaning the system is ready to accept more data. When the buffer is drained, the reading process is resumed using streamRead.resume(). This keeps the flow of data efficient and balanced between reading and writing, preventing memory overload. How This Solves Large Data Handling (TB size) Efficient Memory Usage: The code reads the file in chunks (64 KB at a time) rather than loading the entire file into memory. This is crucial when dealing with large files, such as terabytes of data. By using streams, only a small part of the file is in memory at any given moment.

Flow Control with pause() and resume():

The code pauses the reading when the writable stream is not ready to accept more data. This flow control prevents memory overflow by balancing the rate of reading with the rate of writing. When the writable stream's buffer is cleared, the reading resumes. This ensures that the reading and writing processes are synchronized and efficient. No Need for Temporary Storage: Since the code processes the data as it reads, there's no need to store intermediate data in temporary files or buffers, making the process highly efficient for large data transfers.

Scalability for TB Files: The streaming and flow control mechanism allows this code to handle very large files (e.g., terabytes) without exhausting memory. Only small chunks of data are processed at a time.

In Summary This code efficiently processes large files using Node.js streams by reading and writing data in manageable chunks. The pause() and resume() methods control the flow to prevent memory issues, making it suitable for handling huge datasets without loading them entirely into memory.