matloff / partools

Tools to aid coding in the R 'parallel' package.
40 stars 11 forks source link

Defaults for filesplitrand() #12

Open clarkfitzg opened 7 years ago

clarkfitzg commented 7 years ago

Working on updating the documentation and I noticed that we could probably use fname as the default for newbasename here:

filesplitrand(cls,fname,newbasename,ndigs,header=FALSE,sep)

To avoid changing order of arguments we'll also need default for ndigs, perhaps 2?

Opening this now so I will remember to come back to it.

matloff commented 7 years ago

Both suggestions sound reasonable.

Norm

On Wed, Apr 19, 2017 at 09:46:08AM -0700, Clark Fitzgerald wrote:

Working on updating the documentation and I noticed that we could probably use fname as the default for newbasename here:

filesplitrand(cls,fname,newbasename,ndigs,header=FALSE,sep)

To avoid changing order of arguments we'll also need default for ndigs, perhaps 2?

Opening this now so I will remember to come back to it.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/matloff/partools/issues/12

clarkfitzg commented 7 years ago

Back at it now. Looking at the relevant arguments in "snowdoop" utilities I see:

  nch  Number of chunks for the file split.
  basenm  A chunked file name, minus suffix. 
  infile  Name of a nonchunked file. 
  ndigs  Number of digits in the chunked file name suffix. 
  infilenm  Name of input file (without suffix, if distributed). 
  outdfnm  Name of output file (without suffix). 
  infiledst  If TRUE, infilenm is distributed. 
  usefread  If true, use \code fread  instead of \code read.table ;
     generally much faster; requires \code data.table  package. 
  header  TRUE if the file chunks have headers. 
  seqnums  TRUE if the file chunks will have sequence numbers. 
  sep  Field delimiter used in \code read.table . 
  chunksize  Number of lines to read at a time, for efficient I/O.  
  dname  Quoted name of a distributed data frame or matrix.  For
     \code filesave , the object must have column names.  
  fname  Quoted name of a distributed file. 
  fnames  Character vector of file names. 
  newbasename  Quoted name of the prefix of a distributed file, 
     e.g. \code xyz  for a distributed file \code xyz.01 , \code xyz.02 
     etc. 
  inbasename  basename of the input files, e.g. x for x.1, x.2, ... 
  outbasename  basename of the output files 
  nout  number of output files 
  ...  Additional arguments to \code read.table, write.table 

How about condensing infile, infilename, fname, fnames, newbasename, inbasename into just fname. The infiledst argument along with ndigs can be used to handle the appended numbers. If length(fname) > 1 this can act like fnames.

Also outbasename, outdfname could become outfname for consistency with fname.

matloff commented 7 years ago

Clark, the cost/benefit ratio seems high here. Cost here means your time. This is exactly the kind of thing you should be avoiding, in my opinion.

It certainly is true that the argument names are rather jumbled, a natural consequence of adding more and more things over time. But anytime something is changed, we have to worry about "ecological" effects, even with Travis.

To me, this is a "back burner" thing.

Norm

On Thu, Jun 08, 2017 at 05:45:57PM -0700, Clark Fitzgerald wrote:

Back at it now. Looking at the relevant arguments in "snowdoop" utilities I see:

  nch  Number of chunks for the file split.
  basenm  A chunked file name, minus suffix. 
  infile  Name of a nonchunked file. 
  ndigs  Number of digits in the chunked file name suffix. 
  infilenm  Name of input file (without suffix, if distributed). 
  outdfnm  Name of output file (without suffix). 
  infiledst  If TRUE, infilenm is distributed. 
  usefread  If true, use \code fread  instead of \code read.table ;
     generally much faster; requires \code data.table  package. 
  header  TRUE if the file chunks have headers. 
  seqnums  TRUE if the file chunks will have sequence numbers. 
  sep  Field delimiter used in \code read.table . 
  chunksize  Number of lines to read at a time, for efficient I/O.  
  dname  Quoted name of a distributed data frame or matrix.  For
     \code filesave , the object must have column names.  
  fname  Quoted name of a distributed file. 
  fnames  Character vector of file names. 
  newbasename  Quoted name of the prefix of a distributed file, 
     e.g. \code xyz  for a distributed file \code xyz.01 , \code xyz.02 
     etc. 
  inbasename  basename of the input files, e.g. x for x.1, x.2, ... 
  outbasename  basename of the output files 
  nout  number of output files 
  ...  Additional arguments to \code read.table, write.table 

How about condensing infile, infilename, fname, fnames, newbasename, inbasename into just fname. The infiledst argument along with ndigs can be used to handle the appended numbers. If length(fname) > 1 this can act like fnames.

Also outbasename, outdfname could become outfname for consistency with fname.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/matloff/partools/issues/12#issuecomment-307265139

clarkfitzg commented 7 years ago

Fair enough. I'm going to focus on the file sorting then.