starting with spark - Githubissues

MOJTABAFA commented 8 years ago

@magsol in starting with spark , I'm so sorry if my question are too simple , since I'm very beginner in spark, I need your supports too. we can start from importing a text file and making an RDD for this command as follows :

S = sc.txtFile('../../file_s.txt)

Am I right ? is it needed to use SC.paralellize() at the beginning ?

magsol commented 8 years ago

No worries; I'm working constantly on getting my laptop cleaned out and running again, so I appreciate that you're taking the initiative anyway. I'll work on the code with you as soon as my laptop is ready (which will hopefully be tomorrow).

As for reading in the data: assuming the input format of the data is the same (cc @LindberghLi ), yes you'll use sc.textFile. By invoking this command, the SparkContext object (sc) implicitly uses parallelize, so you don't need to call that.

In fact, textFile is what's called a "lazy" operation, meaning Spark won't actually read the text file until you do something to modify it (or print its contents out).

MOJTABAFA commented 8 years ago

can I change the R1DL_Spark.py a little as follows :

import argparse
from pyspark import SparkContext, SparkConf

def main(sc):
    pass

if __name__ == "__main__":

    # Set up the arguments here.
    parser = argparse.ArgumentParser(description = 'PySpark Dictionary Learning',
        add_help = 'How to use', prog = 'python pyspark-example.py <args>')
    parser.add_argument("-i", "--input", required = True,
        help="Input File name.(file_s)")
    parser.add_argument("-d", "--dictionary", required = True,
        help="Dictionary File name.(file_D)")
    parser.add_argument("-o", "--output", required = True,
        help="Output File name.(file_Z)")
    parser.add_argument("-n", "--pnonzero", type = float, required = True,
        help="Percentage of Non-zero elements.")
    parser.add_argument("-m", "--mDicatom", type = int, required = True,
        help="Number of the dictionary atoms.")
    parser.add_argument("-e", "--epsilon", type = float, required = True,
        help="The value of epsilon.")

    args = vars(parser.parse_args())

    # Initialize the SparkContext. This is where you can create RDDs,
    # the Spark abstraction for distributed data sets.
    sc = SparkContext(conf = SparkConf())

magsol commented 8 years ago

No, let's leave it as is for now.

XiangLi-Shaun commented 8 years ago

As @MOJTABAFA requested, I have put the transpose of the 700Mb test file on:

http://hafni.cs.uga.edu/test_T.txt

The "4.5 million" fMRI data is right now distributed as 68 individual files, I can merge them as a single large file if needed, or we can write a small interface to load them in batch.

MOJTABAFA commented 8 years ago

Thanks xiang, I'll test it and inform you soon.

magsol commented 8 years ago

Actually, spark's "textFile()" method will work on a directory of text files; it reads recursively.

iPhone'd

On Dec 29, 2015, at 17:53, LindberghLi notifications@github.com wrote:

As @MOJTABAFA requested, I have put the transpose of the 700Mb test file on:

http://hafni.cs.uga.edu/test_T.txt

The "4.5 million" fMRI data is right now distributed as 68 individual files, I can merge them as a single large file if needed, or we can write a small interface to load them in batch.

— Reply to this email directly or view it on GitHub.

quinngroup / dr1dl-pyspark

starting with spark #42