国家基因组科学数据中心 GSA数据下载、Linux上游分析实战（二）

国家基因组科学数据中心 GSA数据下载、Linux上游分析实战（二） by 生信小博士

1.初步质控看看结果如何

conda activate rna2mkdir -p fastqc_outputnohup fastqc -t 6 -o fastqc_output CRR*/*.fq.gz &

2.使用multiqc合并质控结果

multiqc ./
multiqc的主要参数--force            -f  Overwrite any existing reports                                                 ││ --config           -c  Specific config file to load, after those in MultiQC dir / home dir / working  ││                        dir.                                                                           ││                        (PATH)                                                                         ││ --cl-config            Specify MultiQC config YAML on the command line (TEXT)                         ││ --filename         -n  Report filename. Use 'stdout' to print to standard out. (TEXT)                 ││ --outdir           -o  Create report in the specified output directory. (TEXT)                        ││ --ignore           -x  Ignore analysis files (GLOB EXPRESSION)                                        ││ --ignore-samples       Ignore sample names (GLOB EXPRESSION)                                          ││ --ignore-symlinks      Ignore symlinked directories and files                                         ││ --file-list        -l  Supply a file containing a list of file paths to be searched, one per row

结果如下：

使用浏览器打开看看

从质控图上看，测序质量挺好，接头含量也没超标

3.使用trimgalory进行质量控制(对于本数据可以忽略这一步)

ls | grep '^CRR' >sample.id #批量获取样本名称find ../CRA005925/ -mindepth 1 -type d|cut -d"/" -f3 >sample.id 
conda activate rna2
#!/bin/bashcat sample.id | while read iddo
mkdir ~/silicosis_wangchen/data/02_trimgalory/${id}echo "${id}"
echo "++++++++===========================+++++++++++"

trim_galore --phred33 -q 20 --length 36    \  --stringency 3 --fastqc --paired --max_n 3 \  -o  ~/silicosis_wangchen/data/02_trimgalory/${id}                          \  ~/silicosis_wangchen/data/CRA005925/${id}/${id}_f1.fq.gz \  ~/silicosis_wangchen/data/CRA005925/${id}/${id}_f2.fq.gzdone
chmod +x batch_trimgalore.shnohup  bash  batch_trimgalore.sh   >trim_galore.log  2>&1  &

trimgalory 详细参数
-q/--quality <INT>      Trim low-quality ends from reads in addition to adapter removal. For                        RRBS samples, quality trimming will be performed first, and adapter                        trimming is carried in a second round. Other files are quality and adapter                        trimmed in a single pass. The algorithm is the same as the one used by BWA                        (Subtract INT from all qualities; compute partial sums from all indices                        to the end of the sequence; cut sequence at the index at which the sum is                        minimal). Default Phred score: 20.
--phred33               Instructs Cutadapt to use ASCII+33 quality scores as Phred scores                        (Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.
--phred64               Instructs Cutadapt to use ASCII+64 quality scores as Phred scores                        (Illumina 1.5 encoding) for quality trimming.
--fastqc                Run FastQC in the default mode on the FastQ file once trimming is complete.
--fastqc_args "<ARGS>"  Passes extra arguments to FastQC. If more than one argument is to be passed                        to FastQC they must be in the form "arg1 arg2 etc.". An example would be:                        --fastqc_args "--nogroup --outdir /home/". Passing extra arguments will                        automatically invoke FastQC, so --fastqc does not have to be specified                        separately.
-a/--adapter <STRING>   Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will                        try to auto-detect whether the Illumina universal, Nextera transposase or Illumina                        small RNA adapter sequence was used. Also see '--illumina', '--nextera' and                        '--small_rna'. If no adapter can be detected within the first 1 million sequences                        of the first file specified or if there is a tie between several adapter sequences,                        Trim Galore defaults to '--illumina' (as long as the Illumina adapter was one of the                        options, else '--nextera' is the default). A single base                        may also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA.
                        At a special request, multiple adapters can also be specified like so:                        -a  " AGCTCCCG -a TTTCATTATAT -a TTTATTCGGATTTAT"                        -a2 " AGCTAGCG -a TCTCTTATAT -a TTTCGGATTTAT", or so:                        -a "file:../multiple_adapters.fa"                        -a2 "file:../different_adapters.fa"                        Potentially in conjucntion with the parameter "-n 3" to trim all adapters. Please note                        that this is NOT needed for standard trimming!                         More Information here: https://github.com/FelixKrueger/TrimGalore/issues/86
--max_length <INT>      Discard reads that are longer than <INT> bp after trimming. This is only advised for                        smallRNA sequencing to remove non-small RNA sequences.

--stringency <INT>      Overlap with adapter sequence required to trim a sequence. Defaults to a                        very stringent setting of 1, i.e. even a single bp of overlapping sequence                        will be trimmed off from the 3' end of any read.

--length <INT>          Discard reads that became shorter than length INT because of either                        quality or adapter trimming. A value of '0' effectively disables                        this behaviour. Default: 20 bp.
                        For paired-end files, both reads of a read-pair need to be longer than                        <INT> bp to be printed out to validated paired-end files (see option --paired).                        If only one read became too short there is the possibility of keeping such                        unpaired single-end reads (see --retain_unpaired). Default pair-cutoff: 20 bp.
--max_n COUNT           The total number of Ns a read may contain before it will be removed altogether.                        In a paired-end setting, either read exceeding this limit will result in the entire                        pair being removed from the trimmed output files. If COUNT is a number between 0 and 1,                        it is interpreted as a fraction of the read length.

在Linux中，你可以使用ls命令与通配符来获取当前目录下所有以"CRR"开头的文件名称。具体命令如下：


ls CRR*
此命令会列出当前目录中所有以"CRR"开头的文件和目录。如果你只想列出文件名，可以结合grep和ls命令来过滤：

ls | grep '^CRR'
此命令会列出当前目录中所有以"CRR"开头的文件和目录名称。如果你只想要文件而不包含目录，可以使用find命令：find . -maxdepth 1 -type f -name 'CRR*'
此命令会列出当前目录（不包括子目录）中所有以"CRR"开头的文件。



4.Hista2比对
cat >map.sh
#!/bin/bash
#获取样本名称，其实就是每个fastq文件的前缀ls ../CRA005925/ |grep '^CRR' |while read i
do  echo ${i}  echo "==========*****************======="  hisat2 -x ~/20240125_rnaseq/data/4086673388_YoungLeeyyL/_/index/grcm38/genome -p 20 --summary-file ./${i}.mapinfo.txt \            -1 ~/silicosis_wangchen/data/CRA005925/${i}/${i}_f1.fq.gz \            -2 ~/silicosis_wangchen/data/CRA005925/${i}/${i}_r2.fq.gz \            |samtools sort -@ 20 -o ${i}.sorted.bamdone
chmod +x map.shnohup  bash  map.sh  >map.log  2>&1  &

重要参数
 --phred33          qualities are Phred+33 (default) --phred64          qualities are Phred+64
 -p/--threads <int> number of alignment threads to launch (1)
Usage:   hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).  <m1>       Files with #1 mates, paired with files in <m2>.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).  <m2>       Files with #2 mates, paired with files in <m1>.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).  <r>        Files with unpaired reads.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).  <sam>      File for SAM output (default: stdout)
  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

5.featurecounts定量

cat  >count.sh #!/bin/bashfeatureCounts -a ~/20240125_rnaseq/data/4086673388_YoungLeeyyL/_/index/Mus_musculus.GRCm38.102.gtf  \--extraAttributes gene_name,gene_biotype -T 60 \-t exon -p -o ./all_samples_counts.txt  ~/silicosis_wangchen/data/03map-data/*.bam

重要参数：Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 
## Mandatory arguments:
  -a <string>         Name of an annotation file. GTF/GFF format by default. See                      -F option for more format information. Inbuilt annotations                      (SAF format) is available in 'annotation' directory of the                      package. Gzipped file is also accepted.
  -o <string>         Name of output file including read counts. A separate file                      including summary statistics of counting results is also                      included in the output ('<string>.summary'). Both files                      are in tab delimited format.
  input_file1 [input_file2] ...   A list of SAM or BAM format files. They can be                      either name or location sorted. If no files provided,                      <stdin> input is expected. Location-sorted paired-end reads are automatically sorted by read names.
-T <int>            Number of the threads. 1 by default.
# Parameters specific to paired end reads
  -p                  If specified, libraries are assumed to contain paired-end                      reads. For any library that contains paired-end reads, the                      'countReadPairs' parameter controls if read pairs or reads                      should be counted.

不可查看链接：马拉松课程https://mp.csdn.net/mp_blog/creation/editor/135890652
jimmy2022https://mp.csdn.net/mp_blog/creation/editor/128177352

ixxmu / mp_duty

国家基因组科学数据中心 GSA数据下载、Linux上游分析实战（二） #5169

国家基因组科学数据中心 GSA数据下载、Linux上游分析实战（二） by 生信小博士