loh-tar / cpd

A pure bash script to collect copy jobs and start them only if the target drive is not busy by any other job under supervision of cpd
GNU General Public License v2.0
2 stars 1 forks source link

Removing unneeded parts #8

Closed loh-tar closed 6 years ago

loh-tar commented 6 years ago

Your Advice is appreciated. I always thought to support two mode

1) Tracking each file with progress info

2) Some bulk-mode where the call is done as it would native on the command line, with the thought it may perform better. Right now I didn't got it really working. It is now a series of calls by xargs.

The Question is: Is it OK to have only my tracking mode 1?

BTW: runTest is a little broken. No idea how that happens. This way the new check seams to work:

if ! ./cpd -D|grep -q "Simulation mode" ; then
Ambrevar commented 6 years ago

I don't understand point 2. What would perform better? What is a "bulk-mode"?

As we mentioned in previous issues, I think the only way out of tangled knot of issues is to only operate of files, not folders. Leave the folder traversal to tools like find and pass the result to cpd via a file list (file/stdin, possibly formated in JSON).

loh-tar commented 6 years ago

My source looks right now so:

  cut -f2- "$tmpDir/job-files-$1" | tr '\n' '\000' | xargs -0   \
    cp ${jobData["$1:CPYOPT"]} "${jobData["$1:TARGET"]}"        \
      > "$tmpDir/msg.log-job-$1" 2>"$tmpDir/error.log-job-$1"

I may wrong, but I think if we have e.g. 100 source files, there will be 100 times called cp. With "bulk" I mean there should be only one call with 100 files listed as arguments.

My problem is, I'm more like a noob as an guru and have always trouble to understand or solve the simplest things. Would be nice to receive a patch from someone to solve this issue. There should be the possibility to free position where the \<list-of-files> are inserted in the cp call.

Regarding -r/-m you will see at my other post I plan a modification (ignored with -o) but like to keep they.

I'm not a big friend of "new fancy" things like JSON. Newlines in filenames are from my point of view stupid and I'm feeling good to keep this bug. At least til version 1.0 I will not make use of JSON.

Ambrevar commented 6 years ago

The point of xargs is precisely to execute the command once over the list of args passed from stdin. So this is totally fine (save a bash pitfall...).

There should be the possibility to free position where the are inserted in the cp call.

What do you mean?

I'm not a big friend of "new fancy" things like JSON. Newlines in filenames are from my point of view stupid and I'm feeling good to keep this bug.

Non-support for newlines is an absolute no-go for any system tool. Filenames could be generated, and newline could be inserted by mistake, or maybe filenames have to conform to the same formating as some title-subtitle with a newline: you never know.

JSON is neither new nor fancy: it is in fact the simplest serialization format you could think of: http://json.org/ It is widely supported, widely available, simple and efficient.

That being said, you don't need to use it. You could make your list \0 separated. Most find programs can separate the output with \0 thanks to the -print0 option. pt has a -0 option, I think ag too.

If you want to support both, do like the rest of the tools: assume \n separated by default, and \0 separated with the -0 option.

Ambrevar commented 6 years ago

That being said, how do you intent to track progress of all the copies? Running one cp per job will only allow the user to know how many jobs are left, but not how far the job progress is. Running one cp after an other would allow the user to keep track of how many files are left per job.

Another nice property of having all files processed one after another is that when some jobs process files split across different devices, it allows the device-based scheduler to schedule the individual file copies within those jobs.

For instance:

Then cpd can copy foo and quux in parallel, and qux and bar in parallel.

loh-tar commented 6 years ago

The point of xargs is precisely to execute the command once over the list of args passed from stdin.

Hmm, ok. Somehow I have got a different feeling while fumbling with that

There should be the possibility to free position where the are inserted in the cp call. What do you mean?

Well, so far I see, xargs add the list of args to the end of the command line. While a little thinking about the new needs regarding to support the typical cp syntax it seemed to be the need to put these list not at the end, but I cant remember why...Probably a false thought.

Or was it this: I would like execute two commands at the same time with the same list, the cp call and some logging.

Filenames could be generated, and newline could be inserted by mistake, or maybe filenames have to conform to the same formating as some title-subtitle with a newline: you never know.

Maybe. But still I think this should be forbidden by POSIX and mistakes on some other side are there problem.

JSON is neither new nor fancy It is widely supported, widely available

sure it is :-)

simple and efficient

Ha, I don't know. Not for my limited mind

You could make your list \0 separated.

Well, while code this recent change I read of cause all this -print0 stuff and the explanations. But I don't have used it because of some processing I would like to do. Perhaps to fast ignored.

Feels free to take a look at the source, probably easy to modify. sort/wc/read has so far I remember options for NULL terminated files, hmm...

Running one cp per job will only allow the user to know how many jobs are left, but not how far the job progress is.

Yes, right now there are files and bytes counted and displayed, but not the progress of one file. That why I noted the bsd cp feature somewhere

You didn't noted this? Run some test with default setting optNoProgress="0" and watch it by -D

job1 has /mnt/a/foo, /mnt/b/bar -> /mnt/dest1 job2 has /mnt/a/qux, /mnt/b/quux -> /mnt/dest2

Then cpd can copy foo and quux in parallel, and qux and bar in parallel.

Ahem??? That the way it in general works, as far as dest1/dest2 are different drives, not volumes

loh-tar commented 6 years ago

>The point of xargs is precisely to execute the command once over the list of args passed from stdin. Hmm, ok. Somehow I have got a different feeling while fumbling with that

Just one thought regarding performance. How much of a guru you are?

How does the kernel and cp execute such a call with 100 files? Is in the end probably no, or nearly no, difference in the execution process when I run a loop and call cp with one file after the other vs all at once by xargs?

My guess is with small files it could matter, but with large not. Because they are not copied in parallel. A small file, copied in milli/microseconds, there may the loop run and all the overhead notable but with a bigger file needing some seconds or more is the looping time not an issue.

I had done some poor test while coding these change and there was no notable difference, that's may the reason why I thought that something with my xargs works not as intended.

With these thoughts I had the idea not to have a user switch (optNoProgress) but to judge by the job how to copy the files. To split each job in two parts. I sort the files anyway to filter unintended doubles by size/name. So running one xargs call with all files smaller than some threshold and then the rest of bigger files one by one with a nice progess tracking.

Ha, while writing this, the other way around is smarter. The biggest file first. When the execution time drop below one second copy the rest of the list in once by xargs. Yes, sounds great to me. Thanks for your attention!

Now, how can I read from my file-list on by one, and use the rest of it for the xargs call? Tricky. Ideas?